run:R W Run
7.09 KB
2026-03-11 16:18:52
R W Run
2.71 KB
2026-03-11 16:18:52
R W Run
16.3 KB
2026-03-11 16:18:52
R W Run
24.79 KB
2026-03-11 16:18:52
R W Run
21.95 KB
2026-03-11 16:18:52
R W Run
11.07 KB
2026-03-11 16:18:52
R W Run
208.44 KB
2026-03-11 16:18:52
R W Run
1.07 KB
2026-03-11 16:18:52
R W Run
1.6 KB
2026-03-11 16:18:52
R W Run
147.75 KB
2026-03-11 16:18:52
R W Run
1.38 KB
2026-03-11 16:18:52
R W Run
3.33 KB
2026-03-11 16:18:52
R W Run
3.52 KB
2026-03-11 16:18:52
R W Run
78.28 KB
2026-03-11 16:18:52
R W Run
error_log
πŸ“„class-wp-html-tag-processor.php
1<?php
2/**
3 * HTML API: WP_HTML_Tag_Processor class
4 *
5 * Scans through an HTML document to find specific tags, then
6 * transforms those tags by adding, removing, or updating the
7 * values of the HTML attributes within that tag (opener).
8 *
9 * Does not fully parse HTML or _recurse_ into the HTML structure
10 * Instead this scans linearly through a document and only parses
11 * the HTML tag openers.
12 *
13 * ### Possible future direction for this module
14 *
15 * - Prune the whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c".
16 * This would increase the size of the changes for some operations but leave more
17 * natural-looking output HTML.
18 *
19 * @package WordPress
20 * @subpackage HTML-API
21 * @since 6.2.0
22 */
23
24/**
25 * Core class used to modify attributes in an HTML document for tags matching a query.
26 *
27 * ## Usage
28 *
29 * Use of this class requires three steps:
30 *
31 * 1. Create a new class instance with your input HTML document.
32 * 2. Find the tag(s) you are looking for.
33 * 3. Request changes to the attributes in those tag(s).
34 *
35 * Example:
36 *
37 * $tags = new WP_HTML_Tag_Processor( $html );
38 * if ( $tags->next_tag( 'option' ) ) {
39 * $tags->set_attribute( 'selected', true );
40 * }
41 *
42 * ### Finding tags
43 *
44 * The `next_tag()` function moves the internal cursor through
45 * your input HTML document until it finds a tag meeting any of
46 * the supplied restrictions in the optional query argument. If
47 * no argument is provided then it will find the next HTML tag,
48 * regardless of what kind it is.
49 *
50 * If you want to _find whatever the next tag is_:
51 *
52 * $tags->next_tag();
53 *
54 * | Goal | Query |
55 * |-----------------------------------------------------------|---------------------------------------------------------------------------------|
56 * | Find any tag. | `$tags->next_tag();` |
57 * | Find next image tag. | `$tags->next_tag( array( 'tag_name' => 'img' ) );` |
58 * | Find next image tag (without passing the array). | `$tags->next_tag( 'img' );` |
59 * | Find next tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'class_name' => 'fullwidth' ) );` |
60 * | Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'tag_name' => 'img', 'class_name' => 'fullwidth' ) );` |
61 *
62 * If a tag was found meeting your criteria then `next_tag()`
63 * will return `true` and you can proceed to modify it. If it
64 * returns `false`, however, it failed to find the tag and
65 * moved the cursor to the end of the file.
66 *
67 * Once the cursor reaches the end of the file the processor
68 * is done and if you want to reach an earlier tag you will
69 * need to recreate the processor and start over, as it's
70 * unable to back up or move in reverse.
71 *
72 * See the section on bookmarks for an exception to this
73 * no-backing-up rule.
74 *
75 * #### Custom queries
76 *
77 * Sometimes it's necessary to further inspect an HTML tag than
78 * the query syntax here permits. In these cases one may further
79 * inspect the search results using the read-only functions
80 * provided by the processor or external state or variables.
81 *
82 * Example:
83 *
84 * // Paint up to the first five DIV or SPAN tags marked with the "jazzy" style.
85 * $remaining_count = 5;
86 * while ( $remaining_count > 0 && $tags->next_tag() ) {
87 * if (
88 * ( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) &&
89 * 'jazzy' === $tags->get_attribute( 'data-style' )
90 * ) {
91 * $tags->add_class( 'theme-style-everest-jazz' );
92 * $remaining_count--;
93 * }
94 * }
95 *
96 * `get_attribute()` will return `null` if the attribute wasn't present
97 * on the tag when it was called. It may return `""` (the empty string)
98 * in cases where the attribute was present but its value was empty.
99 * For boolean attributes, those whose name is present but no value is
100 * given, it will return `true` (the only way to set `false` for an
101 * attribute is to remove it).
102 *
103 * #### When matching fails
104 *
105 * When `next_tag()` returns `false` it could mean different things:
106 *
107 * - The requested tag wasn't found in the input document.
108 * - The input document ended in the middle of an HTML syntax element.
109 *
110 * When a document ends in the middle of a syntax element it will pause
111 * the processor. This is to make it possible in the future to extend the
112 * input document and proceed - an important requirement for chunked
113 * streaming parsing of a document.
114 *
115 * Example:
116 *
117 * $processor = new WP_HTML_Tag_Processor( 'This <div is="a" partial="token' );
118 * false === $processor->next_tag();
119 *
120 * If a special element (see next section) is encountered but no closing tag
121 * is found it will count as an incomplete tag. The parser will pause as if
122 * the opening tag were incomplete.
123 *
124 * Example:
125 *
126 * $processor = new WP_HTML_Tag_Processor( '<style>// there could be more styling to come' );
127 * false === $processor->next_tag();
128 *
129 * $processor = new WP_HTML_Tag_Processor( '<style>// this is everything</style><div>' );
130 * true === $processor->next_tag( 'DIV' );
131 *
132 * #### Special self-contained elements
133 *
134 * Some HTML elements are handled in a special way; their start and end tags
135 * act like a void tag. These are special because their contents can't contain
136 * HTML markup. Everything inside these elements is handled in a special way
137 * and content that _appears_ like HTML tags inside of them isn't. There can
138 * be no nesting in these elements.
139 *
140 * In the following list, "raw text" means that all of the content in the HTML
141 * until the matching closing tag is treated verbatim without any replacements
142 * and without any parsing.
143 *
144 * - IFRAME allows no content but requires a closing tag.
145 * - NOEMBED (deprecated) content is raw text.
146 * - NOFRAMES (deprecated) content is raw text.
147 * - SCRIPT content is plaintext apart from legacy rules allowing `</script>` inside an HTML comment.
148 * - STYLE content is raw text.
149 * - TITLE content is plain text but character references are decoded.
150 * - TEXTAREA content is plain text but character references are decoded.
151 * - XMP (deprecated) content is raw text.
152 *
153 * ### Modifying HTML attributes for a found tag
154 *
155 * Once you've found the start of an opening tag you can modify
156 * any number of the attributes on that tag. You can set a new
157 * value for an attribute, remove the entire attribute, or do
158 * nothing and move on to the next opening tag.
159 *
160 * Example:
161 *
162 * if ( $tags->next_tag( array( 'class_name' => 'wp-group-block' ) ) ) {
163 * $tags->set_attribute( 'title', 'This groups the contained content.' );
164 * $tags->remove_attribute( 'data-test-id' );
165 * }
166 *
167 * If `set_attribute()` is called for an existing attribute it will
168 * overwrite the existing value. Similarly, calling `remove_attribute()`
169 * for a non-existing attribute has no effect on the document. Both
170 * of these methods are safe to call without knowing if a given attribute
171 * exists beforehand.
172 *
173 * ### Modifying CSS classes for a found tag
174 *
175 * The tag processor treats the `class` attribute as a special case.
176 * Because it's a common operation to add or remove CSS classes, this
177 * interface adds helper methods to make that easier.
178 *
179 * As with attribute values, adding or removing CSS classes is a safe
180 * operation that doesn't require checking if the attribute or class
181 * exists before making changes. If removing the only class then the
182 * entire `class` attribute will be removed.
183 *
184 * Example:
185 *
186 * // from `<span>Yippee!</span>`
187 * // to `<span class="is-active">Yippee!</span>`
188 * $tags->add_class( 'is-active' );
189 *
190 * // from `<span class="excited">Yippee!</span>`
191 * // to `<span class="excited is-active">Yippee!</span>`
192 * $tags->add_class( 'is-active' );
193 *
194 * // from `<span class="is-active heavy-accent">Yippee!</span>`
195 * // to `<span class="is-active heavy-accent">Yippee!</span>`
196 * $tags->add_class( 'is-active' );
197 *
198 * // from `<input type="text" class="is-active rugby not-disabled" length="24">`
199 * // to `<input type="text" class="is-active not-disabled" length="24">
200 * $tags->remove_class( 'rugby' );
201 *
202 * // from `<input type="text" class="rugby" length="24">`
203 * // to `<input type="text" length="24">
204 * $tags->remove_class( 'rugby' );
205 *
206 * // from `<input type="text" length="24">`
207 * // to `<input type="text" length="24">
208 * $tags->remove_class( 'rugby' );
209 *
210 * When class changes are enqueued but a direct change to `class` is made via
211 * `set_attribute` then the changes to `set_attribute` (or `remove_attribute`)
212 * will take precedence over those made through `add_class` and `remove_class`.
213 *
214 * ### Bookmarks
215 *
216 * While scanning through the input HTMl document it's possible to set
217 * a named bookmark when a particular tag is found. Later on, after
218 * continuing to scan other tags, it's possible to `seek` to one of
219 * the set bookmarks and then proceed again from that point forward.
220 *
221 * Because bookmarks create processing overhead one should avoid
222 * creating too many of them. As a rule, create only bookmarks
223 * of known string literal names; avoid creating "mark_{$index}"
224 * and so on. It's fine from a performance standpoint to create a
225 * bookmark and update it frequently, such as within a loop.
226 *
227 * $total_todos = 0;
228 * while ( $p->next_tag( array( 'tag_name' => 'UL', 'class_name' => 'todo' ) ) ) {
229 * $p->set_bookmark( 'list-start' );
230 * while ( $p->next_tag( array( 'tag_closers' => 'visit' ) ) ) {
231 * if ( 'UL' === $p->get_tag() && $p->is_tag_closer() ) {
232 * $p->set_bookmark( 'list-end' );
233 * $p->seek( 'list-start' );
234 * $p->set_attribute( 'data-contained-todos', (string) $total_todos );
235 * $total_todos = 0;
236 * $p->seek( 'list-end' );
237 * break;
238 * }
239 *
240 * if ( 'LI' === $p->get_tag() && ! $p->is_tag_closer() ) {
241 * $total_todos++;
242 * }
243 * }
244 * }
245 *
246 * ## Tokens and finer-grained processing.
247 *
248 * It's possible to scan through every lexical token in the
249 * HTML document using the `next_token()` function. This
250 * alternative form takes no argument and provides no built-in
251 * query syntax.
252 *
253 * Example:
254 *
255 * $title = '(untitled)';
256 * $text = '';
257 * while ( $processor->next_token() ) {
258 * switch ( $processor->get_token_name() ) {
259 * case '#text':
260 * $text .= $processor->get_modifiable_text();
261 * break;
262 *
263 * case 'BR':
264 * $text .= "\n";
265 * break;
266 *
267 * case 'TITLE':
268 * $title = $processor->get_modifiable_text();
269 * break;
270 * }
271 * }
272 * return trim( "# {$title}\n\n{$text}" );
273 *
274 * ### Tokens and _modifiable text_.
275 *
276 * #### Special "atomic" HTML elements.
277 *
278 * Not all HTML elements are able to contain other elements inside of them.
279 * For instance, the contents inside a TITLE element are plaintext (except
280 * that character references like &amp; will be decoded). This means that
281 * if the string `<img>` appears inside a TITLE element, then it's not an
282 * image tag, but rather it's text describing an image tag. Likewise, the
283 * contents of a SCRIPT or STYLE element are handled entirely separately in
284 * a browser than the contents of other elements because they represent a
285 * different language than HTML.
286 *
287 * For these elements the Tag Processor treats the entire sequence as one,
288 * from the opening tag, including its contents, through its closing tag.
289 * This means that the it's not possible to match the closing tag for a
290 * SCRIPT element unless it's unexpected; the Tag Processor already matched
291 * it when it found the opening tag.
292 *
293 * The inner contents of these elements are that element's _modifiable text_.
294 *
295 * The special elements are:
296 * - `SCRIPT` whose contents are treated as raw plaintext but supports a legacy
297 * style of including JavaScript inside of HTML comments to avoid accidentally
298 * closing the SCRIPT from inside a JavaScript string. E.g. `console.log( '</script>' )`.
299 * - `TITLE` and `TEXTAREA` whose contents are treated as plaintext and then any
300 * character references are decoded. E.g. `1 &lt; 2 < 3` becomes `1 < 2 < 3`.
301 * - `IFRAME`, `NOSCRIPT`, `NOEMBED`, `NOFRAME`, `STYLE` whose contents are treated as
302 * raw plaintext and left as-is. E.g. `1 &lt; 2 < 3` remains `1 &lt; 2 < 3`.
303 *
304 * #### Other tokens with modifiable text.
305 *
306 * There are also non-elements which are void/self-closing in nature and contain
307 * modifiable text that is part of that individual syntax token itself.
308 *
309 * - `#text` nodes, whose entire token _is_ the modifiable text.
310 * - HTML comments and tokens that become comments due to some syntax error. The
311 * text for these tokens is the portion of the comment inside of the syntax.
312 * E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included).
313 * - `CDATA` sections, whose text is the content inside of the section itself. E.g. for
314 * `<![CDATA[some content]]>` the text is `"some content"` (with restrictions [1]).
315 * - "Funky comments," which are a special case of invalid closing tags whose name is
316 * invalid. The text for these nodes is the text that a browser would transform into
317 * an HTML comment when parsing. E.g. for `</%post_author>` the text is `%post_author`.
318 * - `DOCTYPE` declarations like `<DOCTYPE html>` which have no closing tag.
319 * - XML Processing instruction nodes like `<?wp __( "Like" ); ?>` (with restrictions [2]).
320 * - The empty end tag `</>` which is ignored in the browser and DOM.
321 *
322 * [1]: There are no CDATA sections in HTML. When encountering `<![CDATA[`, everything
323 * until the next `>` becomes a bogus HTML comment, meaning there can be no CDATA
324 * section in an HTML document containing `>`. The Tag Processor will first find
325 * all valid and bogus HTML comments, and then if the comment _would_ have been a
326 * CDATA section _were they to exist_, it will indicate this as the type of comment.
327 *
328 * [2]: XML allows a broader range of characters in a processing instruction's target name
329 * and disallows "xml" as a name, since it's special. The Tag Processor only recognizes
330 * target names with an ASCII-representable subset of characters. It also exhibits the
331 * same constraint as with CDATA sections, in that `>` cannot exist within the token
332 * since Processing Instructions do no exist within HTML and their syntax transforms
333 * into a bogus comment in the DOM.
334 *
335 * ## Design and limitations
336 *
337 * The Tag Processor is designed to linearly scan HTML documents and tokenize
338 * HTML tags and their attributes. It's designed to do this as efficiently as
339 * possible without compromising parsing integrity. Therefore it will be
340 * slower than some methods of modifying HTML, such as those incorporating
341 * over-simplified PCRE patterns, but will not introduce the defects and
342 * failures that those methods bring in, which lead to broken page renders
343 * and often to security vulnerabilities. On the other hand, it will be faster
344 * than full-blown HTML parsers such as DOMDocument and use considerably
345 * less memory. It requires a negligible memory overhead, enough to consider
346 * it a zero-overhead system.
347 *
348 * The performance characteristics are maintained by avoiding tree construction
349 * and semantic cleanups which are specified in HTML5. Because of this, for
350 * example, it's not possible for the Tag Processor to associate any given
351 * opening tag with its corresponding closing tag, or to return the inner markup
352 * inside an element. Systems may be built on top of the Tag Processor to do
353 * this, but the Tag Processor is and should be constrained so it can remain an
354 * efficient, low-level, and reliable HTML scanner.
355 *
356 * The Tag Processor's design incorporates a "garbage-in-garbage-out" philosophy.
357 * HTML5 specifies that certain invalid content be transformed into different forms
358 * for display, such as removing null bytes from an input document and replacing
359 * invalid characters with the Unicode replacement character `U+FFFD` (visually "οΏ½").
360 * Where errors or transformations exist within the HTML5 specification, the Tag Processor
361 * leaves those invalid inputs untouched, passing them through to the final browser
362 * to handle. While this implies that certain operations will be non-spec-compliant,
363 * such as reading the value of an attribute with invalid content, it also preserves a
364 * simplicity and efficiency for handling those error cases.
365 *
366 * Most operations within the Tag Processor are designed to minimize the difference
367 * between an input and output document for any given change. For example, the
368 * `add_class` and `remove_class` methods preserve whitespace and the class ordering
369 * within the `class` attribute; and when encountering tags with duplicated attributes,
370 * the Tag Processor will leave those invalid duplicate attributes where they are but
371 * update the proper attribute which the browser will read for parsing its value. An
372 * exception to this rule is that all attribute updates store their values as
373 * double-quoted strings, meaning that attributes on input with single-quoted or
374 * unquoted values will appear in the output with double-quotes.
375 *
376 * ### Scripting Flag
377 *
378 * The Tag Processor parses HTML with the "scripting flag" disabled. This means
379 * that it doesn't run any scripts while parsing the page. In a browser with
380 * JavaScript enabled, for example, the script can change the parse of the
381 * document as it loads. On the server, however, evaluating JavaScript is not
382 * only impractical, but also unwanted.
383 *
384 * Practically this means that the Tag Processor will descend into NOSCRIPT
385 * elements and process its child tags. Were the scripting flag enabled, such
386 * as in a typical browser, the contents of NOSCRIPT are skipped entirely.
387 *
388 * This allows the HTML API to process the content that will be presented in
389 * a browser when scripting is disabled, but it offers a different view of a
390 * page than most browser sessions will experience. E.g. the tags inside the
391 * NOSCRIPT disappear.
392 *
393 * ### Text Encoding
394 *
395 * The Tag Processor assumes that the input HTML document is encoded with a
396 * text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=',
397 * "'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab,
398 * carriage-return, newline, and form-feed.
399 *
400 * In practice, this includes almost every single-byte encoding as well as
401 * UTF-8. Notably, however, it does not include UTF-16. If providing input
402 * that's incompatible, then convert the encoding beforehand.
403 *
404 * @since 6.2.0
405 * @since 6.2.1 Fix: Support for various invalid comments; attribute updates are case-insensitive.
406 * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE.
407 * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token.
408 * Introduces "special" elements which act like void elements, e.g. TITLE, STYLE.
409 * Allows scanning through all tokens and processing modifiable text, where applicable.
410 */
411class WP_HTML_Tag_Processor {
412 /**
413 * The maximum number of bookmarks allowed to exist at
414 * any given time.
415 *
416 * @since 6.2.0
417 * @var int
418 *
419 * @see WP_HTML_Tag_Processor::set_bookmark()
420 */
421 const MAX_BOOKMARKS = 10;
422
423 /**
424 * Maximum number of times seek() can be called.
425 * Prevents accidental infinite loops.
426 *
427 * @since 6.2.0
428 * @var int
429 *
430 * @see WP_HTML_Tag_Processor::seek()
431 */
432 const MAX_SEEK_OPS = 1000;
433
434 /**
435 * The HTML document to parse.
436 *
437 * @since 6.2.0
438 * @var string
439 */
440 protected $html;
441
442 /**
443 * The last query passed to next_tag().
444 *
445 * @since 6.2.0
446 * @var array|null
447 */
448 private $last_query;
449
450 /**
451 * The tag name this processor currently scans for.
452 *
453 * @since 6.2.0
454 * @var string|null
455 */
456 private $sought_tag_name;
457
458 /**
459 * The CSS class name this processor currently scans for.
460 *
461 * @since 6.2.0
462 * @var string|null
463 */
464 private $sought_class_name;
465
466 /**
467 * The match offset this processor currently scans for.
468 *
469 * @since 6.2.0
470 * @var int|null
471 */
472 private $sought_match_offset;
473
474 /**
475 * Whether to visit tag closers, e.g. </div>, when walking an input document.
476 *
477 * @since 6.2.0
478 * @var bool
479 */
480 private $stop_on_tag_closers;
481
482 /**
483 * Specifies mode of operation of the parser at any given time.
484 *
485 * | State | Meaning |
486 * | ----------------|----------------------------------------------------------------------|
487 * | *Ready* | The parser is ready to run. |
488 * | *Complete* | There is nothing left to parse. |
489 * | *Incomplete* | The HTML ended in the middle of a token; nothing more can be parsed. |
490 * | *Matched tag* | Found an HTML tag; it's possible to modify its attributes. |
491 * | *Text node* | Found a #text node; this is plaintext and modifiable. |
492 * | *CDATA node* | Found a CDATA section; this is modifiable. |
493 * | *Comment* | Found a comment or bogus comment; this is modifiable. |
494 * | *Presumptuous* | Found an empty tag closer: `</>`. |
495 * | *Funky comment* | Found a tag closer with an invalid tag name; this is modifiable. |
496 *
497 * @since 6.5.0
498 *
499 * @see WP_HTML_Tag_Processor::STATE_READY
500 * @see WP_HTML_Tag_Processor::STATE_COMPLETE
501 * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT
502 * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG
503 * @see WP_HTML_Tag_Processor::STATE_TEXT_NODE
504 * @see WP_HTML_Tag_Processor::STATE_CDATA_NODE
505 * @see WP_HTML_Tag_Processor::STATE_COMMENT
506 * @see WP_HTML_Tag_Processor::STATE_DOCTYPE
507 * @see WP_HTML_Tag_Processor::STATE_PRESUMPTUOUS_TAG
508 * @see WP_HTML_Tag_Processor::STATE_FUNKY_COMMENT
509 *
510 * @var string
511 */
512 protected $parser_state = self::STATE_READY;
513
514 /**
515 * Indicates if the document is in quirks mode or no-quirks mode.
516 *
517 * Impact on HTML parsing:
518 *
519 * - In `NO_QUIRKS_MODE` (also known as "standard mode"):
520 * - CSS class and ID selectors match byte-for-byte (case-sensitively).
521 * - A TABLE start tag `<table>` implicitly closes any open `P` element.
522 *
523 * - In `QUIRKS_MODE`:
524 * - CSS class and ID selectors match match in an ASCII case-insensitive manner.
525 * - A TABLE start tag `<table>` opens a `TABLE` element as a child of a `P`
526 * element if one is open.
527 *
528 * Quirks and no-quirks mode are thus mostly about styling, but have an impact when
529 * tables are found inside paragraph elements.
530 *
531 * @see self::QUIRKS_MODE
532 * @see self::NO_QUIRKS_MODE
533 *
534 * @since 6.7.0
535 *
536 * @var string
537 */
538 protected $compat_mode = self::NO_QUIRKS_MODE;
539
540 /**
541 * Indicates whether the parser is inside foreign content,
542 * e.g. inside an SVG or MathML element.
543 *
544 * One of 'html', 'svg', or 'math'.
545 *
546 * Several parsing rules change based on whether the parser
547 * is inside foreign content, including whether CDATA sections
548 * are allowed and whether a self-closing flag indicates that
549 * an element has no content.
550 *
551 * @since 6.7.0
552 *
553 * @var string
554 */
555 private $parsing_namespace = 'html';
556
557 /**
558 * What kind of syntax token became an HTML comment.
559 *
560 * Since there are many ways in which HTML syntax can create an HTML comment,
561 * this indicates which of those caused it. This allows the Tag Processor to
562 * represent more from the original input document than would appear in the DOM.
563 *
564 * @since 6.5.0
565 *
566 * @var string|null
567 */
568 protected $comment_type = null;
569
570 /**
571 * What kind of text the matched text node represents, if it was subdivided.
572 *
573 * @see self::TEXT_IS_NULL_SEQUENCE
574 * @see self::TEXT_IS_WHITESPACE
575 * @see self::TEXT_IS_GENERIC
576 * @see self::subdivide_text_appropriately
577 *
578 * @since 6.7.0
579 *
580 * @var string
581 */
582 protected $text_node_classification = self::TEXT_IS_GENERIC;
583
584 /**
585 * How many bytes from the original HTML document have been read and parsed.
586 *
587 * This value points to the latest byte offset in the input document which
588 * has been already parsed. It is the internal cursor for the Tag Processor
589 * and updates while scanning through the HTML tokens.
590 *
591 * @since 6.2.0
592 * @var int
593 */
594 private $bytes_already_parsed = 0;
595
596 /**
597 * Byte offset in input document where current token starts.
598 *
599 * Example:
600 *
601 * <div id="test">...
602 * 01234
603 * - token starts at 0
604 *
605 * @since 6.5.0
606 *
607 * @var int|null
608 */
609 private $token_starts_at;
610
611 /**
612 * Byte length of current token.
613 *
614 * Example:
615 *
616 * <div id="test">...
617 * 012345678901234
618 * - token length is 14 - 0 = 14
619 *
620 * a <!-- comment --> is a token.
621 * 0123456789 123456789 123456789
622 * - token length is 17 - 2 = 15
623 *
624 * @since 6.5.0
625 *
626 * @var int|null
627 */
628 private $token_length;
629
630 /**
631 * Byte offset in input document where current tag name starts.
632 *
633 * Example:
634 *
635 * <div id="test">...
636 * 01234
637 * - tag name starts at 1
638 *
639 * @since 6.2.0
640 *
641 * @var int|null
642 */
643 private $tag_name_starts_at;
644
645 /**
646 * Byte length of current tag name.
647 *
648 * Example:
649 *
650 * <div id="test">...
651 * 01234
652 * --- tag name length is 3
653 *
654 * @since 6.2.0
655 *
656 * @var int|null
657 */
658 private $tag_name_length;
659
660 /**
661 * Byte offset into input document where current modifiable text starts.
662 *
663 * @since 6.5.0
664 *
665 * @var int
666 */
667 private $text_starts_at;
668
669 /**
670 * Byte length of modifiable text.
671 *
672 * @since 6.5.0
673 *
674 * @var int
675 */
676 private $text_length;
677
678 /**
679 * Whether the current tag is an opening tag, e.g. <div>, or a closing tag, e.g. </div>.
680 *
681 * @var bool
682 */
683 private $is_closing_tag;
684
685 /**
686 * Lazily-built index of attributes found within an HTML tag, keyed by the attribute name.
687 *
688 * Example:
689 *
690 * // Supposing the parser is working through this content
691 * // and stops after recognizing the `id` attribute.
692 * // <div id="test-4" class=outline title="data:text/plain;base64=asdk3nk1j3fo8">
693 * // ^ parsing will continue from this point.
694 * $this->attributes = array(
695 * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false )
696 * );
697 *
698 * // When picking up parsing again, or when asking to find the
699 * // `class` attribute we will continue and add to this array.
700 * $this->attributes = array(
701 * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ),
702 * 'class' => new WP_HTML_Attribute_Token( 'class', 23, 7, 17, 13, false )
703 * );
704 *
705 * // Note that only the `class` attribute value is stored in the index.
706 * // That's because it is the only value used by this class at the moment.
707 *
708 * @since 6.2.0
709 * @var WP_HTML_Attribute_Token[]
710 */
711 private $attributes = array();
712
713 /**
714 * Tracks spans of duplicate attributes on a given tag, used for removing
715 * all copies of an attribute when calling `remove_attribute()`.
716 *
717 * @since 6.3.2
718 *
719 * @var (WP_HTML_Span[])[]|null
720 */
721 private $duplicate_attributes = null;
722
723 /**
724 * Which class names to add or remove from a tag.
725 *
726 * These are tracked separately from attribute updates because they are
727 * semantically distinct, whereas this interface exists for the common
728 * case of adding and removing class names while other attributes are
729 * generally modified as with DOM `setAttribute` calls.
730 *
731 * When modifying an HTML document these will eventually be collapsed
732 * into a single `set_attribute( 'class', $changes )` call.
733 *
734 * Example:
735 *
736 * // Add the `wp-block-group` class, remove the `wp-group` class.
737 * $classname_updates = array(
738 * // Indexed by a comparable class name.
739 * 'wp-block-group' => WP_HTML_Tag_Processor::ADD_CLASS,
740 * 'wp-group' => WP_HTML_Tag_Processor::REMOVE_CLASS
741 * );
742 *
743 * @since 6.2.0
744 * @var bool[]
745 */
746 private $classname_updates = array();
747
748 /**
749 * Tracks a semantic location in the original HTML which
750 * shifts with updates as they are applied to the document.
751 *
752 * @since 6.2.0
753 * @var WP_HTML_Span[]
754 */
755 protected $bookmarks = array();
756
757 const ADD_CLASS = true;
758 const REMOVE_CLASS = false;
759 const SKIP_CLASS = null;
760
761 /**
762 * Lexical replacements to apply to input HTML document.
763 *
764 * "Lexical" in this class refers to the part of this class which
765 * operates on pure text _as text_ and not as HTML. There's a line
766 * between the public interface, with HTML-semantic methods like
767 * `set_attribute` and `add_class`, and an internal state that tracks
768 * text offsets in the input document.
769 *
770 * When higher-level HTML methods are called, those have to transform their
771 * operations (such as setting an attribute's value) into text diffing
772 * operations (such as replacing the sub-string from indices A to B with
773 * some given new string). These text-diffing operations are the lexical
774 * updates.
775 *
776 * As new higher-level methods are added they need to collapse their
777 * operations into these lower-level lexical updates since that's the
778 * Tag Processor's internal language of change. Any code which creates
779 * these lexical updates must ensure that they do not cross HTML syntax
780 * boundaries, however, so these should never be exposed outside of this
781 * class or any classes which intentionally expand its functionality.
782 *
783 * These are enqueued while editing the document instead of being immediately
784 * applied to avoid processing overhead, string allocations, and string
785 * copies when applying many updates to a single document.
786 *
787 * Example:
788 *
789 * // Replace an attribute stored with a new value, indices
790 * // sourced from the lazily-parsed HTML recognizer.
791 * $start = $attributes['src']->start;
792 * $length = $attributes['src']->length;
793 * $modifications[] = new WP_HTML_Text_Replacement( $start, $length, $new_value );
794 *
795 * // Correspondingly, something like this will appear in this array.
796 * $lexical_updates = array(
797 * WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' )
798 * );
799 *
800 * @since 6.2.0
801 * @var WP_HTML_Text_Replacement[]
802 */
803 protected $lexical_updates = array();
804
805 /**
806 * Tracks and limits `seek()` calls to prevent accidental infinite loops.
807 *
808 * @since 6.2.0
809 * @var int
810 *
811 * @see WP_HTML_Tag_Processor::seek()
812 */
813 protected $seek_count = 0;
814
815 /**
816 * Whether the parser should skip over an immediately-following linefeed
817 * character, as is the case with LISTING, PRE, and TEXTAREA.
818 *
819 * > If the next token is a U+000A LINE FEED (LF) character token, then
820 * > ignore that token and move on to the next one. (Newlines at the start
821 * > of [these] elements are ignored as an authoring convenience.)
822 *
823 * @since 6.7.0
824 *
825 * @var int|null
826 */
827 private $skip_newline_at = null;
828
829 /**
830 * Constructor.
831 *
832 * @since 6.2.0
833 *
834 * @param string $html HTML to process.
835 */
836 public function __construct( $html ) {
837 if ( ! is_string( $html ) ) {
838 _doing_it_wrong(
839 __METHOD__,
840 __( 'The HTML parameter must be a string.' ),
841 '6.9.0'
842 );
843 $html = '';
844 }
845 $this->html = $html;
846 }
847
848 /**
849 * Switches parsing mode into a new namespace, such as when
850 * encountering an SVG tag and entering foreign content.
851 *
852 * @since 6.7.0
853 *
854 * @param string $new_namespace One of 'html', 'svg', or 'math' indicating into what
855 * namespace the next tokens will be processed.
856 * @return bool Whether the namespace was valid and changed.
857 */
858 public function change_parsing_namespace( string $new_namespace ): bool {
859 if ( ! in_array( $new_namespace, array( 'html', 'math', 'svg' ), true ) ) {
860 return false;
861 }
862
863 $this->parsing_namespace = $new_namespace;
864 return true;
865 }
866
867 /**
868 * Finds the next tag matching the $query.
869 *
870 * @since 6.2.0
871 * @since 6.5.0 No longer processes incomplete tokens at end of document; pauses the processor at start of token.
872 *
873 * @param array|string|null $query {
874 * Optional. Which tag name to find, having which class, etc. Default is to find any tag.
875 *
876 * @type string|null $tag_name Which tag to find, or `null` for "any tag."
877 * @type int|null $match_offset Find the Nth tag matching all search criteria.
878 * 1 for "first" tag, 3 for "third," etc.
879 * Defaults to first tag.
880 * @type string|null $class_name Tag must contain this whole class name to match.
881 * @type string|null $tag_closers "visit" or "skip": whether to stop on tag closers, e.g. </div>.
882 * }
883 * @return bool Whether a tag was matched.
884 */
885 public function next_tag( $query = null ): bool {
886 $this->parse_query( $query );
887 $already_found = 0;
888
889 do {
890 if ( false === $this->next_token() ) {
891 return false;
892 }
893
894 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
895 continue;
896 }
897
898 if ( $this->matches() ) {
899 ++$already_found;
900 }
901 } while ( $already_found < $this->sought_match_offset );
902
903 return true;
904 }
905
906 /**
907 * Finds the next token in the HTML document.
908 *
909 * An HTML document can be viewed as a stream of tokens,
910 * where tokens are things like HTML tags, HTML comments,
911 * text nodes, etc. This method finds the next token in
912 * the HTML document and returns whether it found one.
913 *
914 * If it starts parsing a token and reaches the end of the
915 * document then it will seek to the start of the last
916 * token and pause, returning `false` to indicate that it
917 * failed to find a complete token.
918 *
919 * Possible token types, based on the HTML specification:
920 *
921 * - an HTML tag, whether opening, closing, or void.
922 * - a text node - the plaintext inside tags.
923 * - an HTML comment.
924 * - a DOCTYPE declaration.
925 * - a processing instruction, e.g. `<?xml version="1.0" ?>`.
926 *
927 * The Tag Processor currently only supports the tag token.
928 *
929 * @since 6.5.0
930 * @since 6.7.0 Recognizes CDATA sections within foreign content.
931 *
932 * @return bool Whether a token was parsed.
933 */
934 public function next_token(): bool {
935 return $this->base_class_next_token();
936 }
937
938 /**
939 * Internal method which finds the next token in the HTML document.
940 *
941 * This method is a protected internal function which implements the logic for
942 * finding the next token in a document. It exists so that the parser can update
943 * its state without affecting the location of the cursor in the document and
944 * without triggering subclass methods for things like `next_token()`, e.g. when
945 * applying patches before searching for the next token.
946 *
947 * @since 6.5.0
948 *
949 * @access private
950 *
951 * @return bool Whether a token was parsed.
952 */
953 private function base_class_next_token(): bool {
954 $was_at = $this->bytes_already_parsed;
955 $this->after_tag();
956
957 // Don't proceed if there's nothing more to scan.
958 if (
959 self::STATE_COMPLETE === $this->parser_state ||
960 self::STATE_INCOMPLETE_INPUT === $this->parser_state
961 ) {
962 return false;
963 }
964
965 /*
966 * The next step in the parsing loop determines the parsing state;
967 * clear it so that state doesn't linger from the previous step.
968 */
969 $this->parser_state = self::STATE_READY;
970
971 if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
972 $this->parser_state = self::STATE_COMPLETE;
973 return false;
974 }
975
976 // Find the next tag if it exists.
977 if ( false === $this->parse_next_tag() ) {
978 if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
979 $this->bytes_already_parsed = $was_at;
980 }
981
982 return false;
983 }
984
985 /*
986 * For legacy reasons the rest of this function handles tags and their
987 * attributes. If the processor has reached the end of the document
988 * or if it matched any other token then it should return here to avoid
989 * attempting to process tag-specific syntax.
990 */
991 if (
992 self::STATE_INCOMPLETE_INPUT !== $this->parser_state &&
993 self::STATE_COMPLETE !== $this->parser_state &&
994 self::STATE_MATCHED_TAG !== $this->parser_state
995 ) {
996 return true;
997 }
998
999 // Parse all of its attributes.
1000 while ( $this->parse_next_attribute() ) {
1001 continue;
1002 }
1003
1004 // Ensure that the tag closes before the end of the document.
1005 if (
1006 self::STATE_INCOMPLETE_INPUT === $this->parser_state ||
1007 $this->bytes_already_parsed >= strlen( $this->html )
1008 ) {
1009 // Does this appropriately clear state (parsed attributes)?
1010 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1011 $this->bytes_already_parsed = $was_at;
1012
1013 return false;
1014 }
1015
1016 $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed );
1017 if ( false === $tag_ends_at ) {
1018 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1019 $this->bytes_already_parsed = $was_at;
1020
1021 return false;
1022 }
1023 $this->parser_state = self::STATE_MATCHED_TAG;
1024 $this->bytes_already_parsed = $tag_ends_at + 1;
1025 $this->token_length = $this->bytes_already_parsed - $this->token_starts_at;
1026
1027 /*
1028 * Certain tags require additional processing. The first-letter pre-check
1029 * avoids unnecessary string allocation when comparing the tag names.
1030 *
1031 * - IFRAME
1032 * - LISTING (deprecated)
1033 * - NOEMBED (deprecated)
1034 * - NOFRAMES (deprecated)
1035 * - PRE
1036 * - SCRIPT
1037 * - STYLE
1038 * - TEXTAREA
1039 * - TITLE
1040 * - XMP (deprecated)
1041 */
1042 if (
1043 $this->is_closing_tag ||
1044 'html' !== $this->parsing_namespace ||
1045 1 !== strspn( $this->html, 'iIlLnNpPsStTxX', $this->tag_name_starts_at, 1 )
1046 ) {
1047 return true;
1048 }
1049
1050 $tag_name = $this->get_tag();
1051
1052 /*
1053 * For LISTING, PRE, and TEXTAREA, the first linefeed of an immediately-following
1054 * text node is ignored as an authoring convenience.
1055 *
1056 * @see static::skip_newline_at
1057 */
1058 if ( 'LISTING' === $tag_name || 'PRE' === $tag_name ) {
1059 $this->skip_newline_at = $this->bytes_already_parsed;
1060 return true;
1061 }
1062
1063 /*
1064 * There are certain elements whose children are not DATA but are instead
1065 * RCDATA or RAWTEXT. These cannot contain other elements, and the contents
1066 * are parsed as plaintext, with character references decoded in RCDATA but
1067 * not in RAWTEXT.
1068 *
1069 * These elements are described here as "self-contained" or special atomic
1070 * elements whose end tag is consumed with the opening tag, and they will
1071 * contain modifiable text inside of them.
1072 *
1073 * Preserve the opening tag pointers, as these will be overwritten
1074 * when finding the closing tag. They will be reset after finding
1075 * the closing to tag to point to the opening of the special atomic
1076 * tag sequence.
1077 */
1078 $tag_name_starts_at = $this->tag_name_starts_at;
1079 $tag_name_length = $this->tag_name_length;
1080 $tag_ends_at = $this->token_starts_at + $this->token_length;
1081 $attributes = $this->attributes;
1082 $duplicate_attributes = $this->duplicate_attributes;
1083
1084 // Find the closing tag if necessary.
1085 switch ( $tag_name ) {
1086 case 'SCRIPT':
1087 $found_closer = $this->skip_script_data();
1088 break;
1089
1090 case 'TEXTAREA':
1091 case 'TITLE':
1092 $found_closer = $this->skip_rcdata( $tag_name );
1093 break;
1094
1095 /*
1096 * In the browser this list would include the NOSCRIPT element,
1097 * but the Tag Processor is an environment with the scripting
1098 * flag disabled, meaning that it needs to descend into the
1099 * NOSCRIPT element to be able to properly process what will be
1100 * sent to a browser.
1101 *
1102 * Note that this rule makes HTML5 syntax incompatible with XML,
1103 * because the parsing of this token depends on client application.
1104 * The NOSCRIPT element cannot be represented in the XHTML syntax.
1105 */
1106 case 'IFRAME':
1107 case 'NOEMBED':
1108 case 'NOFRAMES':
1109 case 'STYLE':
1110 case 'XMP':
1111 $found_closer = $this->skip_rawtext( $tag_name );
1112 break;
1113
1114 // No other tags should be treated in their entirety here.
1115 default:
1116 return true;
1117 }
1118
1119 if ( ! $found_closer ) {
1120 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1121 $this->bytes_already_parsed = $was_at;
1122 return false;
1123 }
1124
1125 /*
1126 * The values here look like they reference the opening tag but they reference
1127 * the closing tag instead. This is why the opening tag values were stored
1128 * above in a variable. It reads confusingly here, but that's because the
1129 * functions that skip the contents have moved all the internal cursors past
1130 * the inner content of the tag.
1131 */
1132 $this->token_starts_at = $was_at;
1133 $this->token_length = $this->bytes_already_parsed - $this->token_starts_at;
1134 $this->text_starts_at = $tag_ends_at;
1135 $this->text_length = $this->tag_name_starts_at - $this->text_starts_at;
1136 $this->tag_name_starts_at = $tag_name_starts_at;
1137 $this->tag_name_length = $tag_name_length;
1138 $this->attributes = $attributes;
1139 $this->duplicate_attributes = $duplicate_attributes;
1140
1141 return true;
1142 }
1143
1144 /**
1145 * Whether the processor paused because the input HTML document ended
1146 * in the middle of a syntax element, such as in the middle of a tag.
1147 *
1148 * Example:
1149 *
1150 * $processor = new WP_HTML_Tag_Processor( '<input type="text" value="Th' );
1151 * false === $processor->get_next_tag();
1152 * true === $processor->paused_at_incomplete_token();
1153 *
1154 * @since 6.5.0
1155 *
1156 * @return bool Whether the parse paused at the start of an incomplete token.
1157 */
1158 public function paused_at_incomplete_token(): bool {
1159 return self::STATE_INCOMPLETE_INPUT === $this->parser_state;
1160 }
1161
1162 /**
1163 * Generator for a foreach loop to step through each class name for the matched tag.
1164 *
1165 * This generator function is designed to be used inside a "foreach" loop.
1166 *
1167 * Example:
1168 *
1169 * $p = new WP_HTML_Tag_Processor( "<div class='free &lt;egg&lt;\tlang-en'>" );
1170 * $p->next_tag();
1171 * foreach ( $p->class_list() as $class_name ) {
1172 * echo "{$class_name} ";
1173 * }
1174 * // Outputs: "free <egg> lang-en "
1175 *
1176 * @since 6.4.0
1177 */
1178 public function class_list() {
1179 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
1180 return;
1181 }
1182
1183 /** @var string $class contains the string value of the class attribute, with character references decoded. */
1184 $class = $this->get_attribute( 'class' );
1185
1186 if ( ! is_string( $class ) ) {
1187 return;
1188 }
1189
1190 $seen = array();
1191
1192 $is_quirks = self::QUIRKS_MODE === $this->compat_mode;
1193
1194 $at = 0;
1195 while ( $at < strlen( $class ) ) {
1196 // Skip past any initial boundary characters.
1197 $at += strspn( $class, " \t\f\r\n", $at );
1198 if ( $at >= strlen( $class ) ) {
1199 return;
1200 }
1201
1202 // Find the byte length until the next boundary.
1203 $length = strcspn( $class, " \t\f\r\n", $at );
1204 if ( 0 === $length ) {
1205 return;
1206 }
1207
1208 $name = str_replace( "\x00", "\u{FFFD}", substr( $class, $at, $length ) );
1209 if ( $is_quirks ) {
1210 $name = strtolower( $name );
1211 }
1212 $at += $length;
1213
1214 /*
1215 * It's expected that the number of class names for a given tag is relatively small.
1216 * Given this, it is probably faster overall to scan an array for a value rather
1217 * than to use the class name as a key and check if it's a key of $seen.
1218 */
1219 if ( in_array( $name, $seen, true ) ) {
1220 continue;
1221 }
1222
1223 $seen[] = $name;
1224 yield $name;
1225 }
1226 }
1227
1228
1229 /**
1230 * Returns if a matched tag contains the given ASCII case-insensitive class name.
1231 *
1232 * @since 6.4.0
1233 *
1234 * @param string $wanted_class Look for this CSS class name, ASCII case-insensitive.
1235 * @return bool|null Whether the matched tag contains the given class name, or null if not matched.
1236 */
1237 public function has_class( $wanted_class ): ?bool {
1238 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
1239 return null;
1240 }
1241
1242 $case_insensitive = self::QUIRKS_MODE === $this->compat_mode;
1243
1244 $wanted_length = strlen( $wanted_class );
1245 foreach ( $this->class_list() as $class_name ) {
1246 if (
1247 strlen( $class_name ) === $wanted_length &&
1248 0 === substr_compare( $class_name, $wanted_class, 0, strlen( $wanted_class ), $case_insensitive )
1249 ) {
1250 return true;
1251 }
1252 }
1253
1254 return false;
1255 }
1256
1257
1258 /**
1259 * Sets a bookmark in the HTML document.
1260 *
1261 * Bookmarks represent specific places or tokens in the HTML
1262 * document, such as a tag opener or closer. When applying
1263 * edits to a document, such as setting an attribute, the
1264 * text offsets of that token may shift; the bookmark is
1265 * kept updated with those shifts and remains stable unless
1266 * the entire span of text in which the token sits is removed.
1267 *
1268 * Release bookmarks when they are no longer needed.
1269 *
1270 * Example:
1271 *
1272 * <main><h2>Surprising fact you may not know!</h2></main>
1273 * ^ ^
1274 * \-|-- this `H2` opener bookmark tracks the token
1275 *
1276 * <main class="clickbait"><h2>Surprising fact you may no…
1277 * ^ ^
1278 * \-|-- it shifts with edits
1279 *
1280 * Bookmarks provide the ability to seek to a previously-scanned
1281 * place in the HTML document. This avoids the need to re-scan
1282 * the entire document.
1283 *
1284 * Example:
1285 *
1286 * <ul><li>One</li><li>Two</li><li>Three</li></ul>
1287 * ^^^^
1288 * want to note this last item
1289 *
1290 * $p = new WP_HTML_Tag_Processor( $html );
1291 * $in_list = false;
1292 * while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) {
1293 * if ( 'UL' === $p->get_tag() ) {
1294 * if ( $p->is_tag_closer() ) {
1295 * $in_list = false;
1296 * $p->set_bookmark( 'resume' );
1297 * if ( $p->seek( 'last-li' ) ) {
1298 * $p->add_class( 'last-li' );
1299 * }
1300 * $p->seek( 'resume' );
1301 * $p->release_bookmark( 'last-li' );
1302 * $p->release_bookmark( 'resume' );
1303 * } else {
1304 * $in_list = true;
1305 * }
1306 * }
1307 *
1308 * if ( 'LI' === $p->get_tag() ) {
1309 * $p->set_bookmark( 'last-li' );
1310 * }
1311 * }
1312 *
1313 * Bookmarks intentionally hide the internal string offsets
1314 * to which they refer. They are maintained internally as
1315 * updates are applied to the HTML document and therefore
1316 * retain their "position" - the location to which they
1317 * originally pointed. The inability to use bookmarks with
1318 * functions like `substr` is therefore intentional to guard
1319 * against accidentally breaking the HTML.
1320 *
1321 * Because bookmarks allocate memory and require processing
1322 * for every applied update, they are limited and require
1323 * a name. They should not be created with programmatically-made
1324 * names, such as "li_{$index}" with some loop. As a general
1325 * rule they should only be created with string-literal names
1326 * like "start-of-section" or "last-paragraph".
1327 *
1328 * Bookmarks are a powerful tool to enable complicated behavior.
1329 * Consider double-checking that you need this tool if you are
1330 * reaching for it, as inappropriate use could lead to broken
1331 * HTML structure or unwanted processing overhead.
1332 *
1333 * @since 6.2.0
1334 *
1335 * @param string $name Identifies this particular bookmark.
1336 * @return bool Whether the bookmark was successfully created.
1337 */
1338 public function set_bookmark( $name ): bool {
1339 // It only makes sense to set a bookmark if the parser has paused on a concrete token.
1340 if (
1341 self::STATE_COMPLETE === $this->parser_state ||
1342 self::STATE_INCOMPLETE_INPUT === $this->parser_state
1343 ) {
1344 return false;
1345 }
1346
1347 if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= static::MAX_BOOKMARKS ) {
1348 _doing_it_wrong(
1349 __METHOD__,
1350 __( 'Too many bookmarks: cannot create any more.' ),
1351 '6.2.0'
1352 );
1353 return false;
1354 }
1355
1356 $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length );
1357
1358 return true;
1359 }
1360
1361
1362 /**
1363 * Removes a bookmark that is no longer needed.
1364 *
1365 * Releasing a bookmark frees up the small
1366 * performance overhead it requires.
1367 *
1368 * @param string $name Name of the bookmark to remove.
1369 * @return bool Whether the bookmark already existed before removal.
1370 */
1371 public function release_bookmark( $name ): bool {
1372 if ( ! array_key_exists( $name, $this->bookmarks ) ) {
1373 return false;
1374 }
1375
1376 unset( $this->bookmarks[ $name ] );
1377
1378 return true;
1379 }
1380
1381 /**
1382 * Skips contents of generic rawtext elements.
1383 *
1384 * @since 6.3.2
1385 *
1386 * @see https://html.spec.whatwg.org/#generic-raw-text-element-parsing-algorithm
1387 *
1388 * @param string $tag_name The uppercase tag name which will close the RAWTEXT region.
1389 * @return bool Whether an end to the RAWTEXT region was found before the end of the document.
1390 */
1391 private function skip_rawtext( string $tag_name ): bool {
1392 /*
1393 * These two functions distinguish themselves on whether character references are
1394 * decoded, and since functionality to read the inner markup isn't supported, it's
1395 * not necessary to implement these two functions separately.
1396 */
1397 return $this->skip_rcdata( $tag_name );
1398 }
1399
1400 /**
1401 * Skips contents of RCDATA elements, namely title and textarea tags.
1402 *
1403 * @since 6.2.0
1404 *
1405 * @see https://html.spec.whatwg.org/multipage/parsing.html#rcdata-state
1406 *
1407 * @param string $tag_name The uppercase tag name which will close the RCDATA region.
1408 * @return bool Whether an end to the RCDATA region was found before the end of the document.
1409 */
1410 private function skip_rcdata( string $tag_name ): bool {
1411 $html = $this->html;
1412 $doc_length = strlen( $html );
1413 $tag_length = strlen( $tag_name );
1414
1415 $at = $this->bytes_already_parsed;
1416
1417 while ( false !== $at && $at < $doc_length ) {
1418 $at = strpos( $this->html, '</', $at );
1419 $this->tag_name_starts_at = $at;
1420
1421 // Fail if there is no possible tag closer.
1422 if ( false === $at || ( $at + $tag_length ) >= $doc_length ) {
1423 return false;
1424 }
1425
1426 $at += 2;
1427
1428 /*
1429 * Find a case-insensitive match to the tag name.
1430 *
1431 * Because tag names are limited to US-ASCII there is no
1432 * need to perform any kind of Unicode normalization when
1433 * comparing; any character which could be impacted by such
1434 * normalization could not be part of a tag name.
1435 */
1436 for ( $i = 0; $i < $tag_length; $i++ ) {
1437 $tag_char = $tag_name[ $i ];
1438 $html_char = $html[ $at + $i ];
1439
1440 if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) {
1441 $at += $i;
1442 continue 2;
1443 }
1444 }
1445
1446 $at += $tag_length;
1447 $this->bytes_already_parsed = $at;
1448
1449 if ( $at >= strlen( $html ) ) {
1450 return false;
1451 }
1452
1453 /*
1454 * Ensure that the tag name terminates to avoid matching on
1455 * substrings of a longer tag name. For example, the sequence
1456 * "</textarearug" should not match for "</textarea" even
1457 * though "textarea" is found within the text.
1458 */
1459 $c = $html[ $at ];
1460 if ( ' ' !== $c && "\t" !== $c && "\r" !== $c && "\n" !== $c && '/' !== $c && '>' !== $c ) {
1461 continue;
1462 }
1463
1464 while ( $this->parse_next_attribute() ) {
1465 continue;
1466 }
1467
1468 $at = $this->bytes_already_parsed;
1469 if ( $at >= strlen( $this->html ) ) {
1470 return false;
1471 }
1472
1473 if ( '>' === $html[ $at ] ) {
1474 $this->bytes_already_parsed = $at + 1;
1475 return true;
1476 }
1477
1478 if ( $at + 1 >= strlen( $this->html ) ) {
1479 return false;
1480 }
1481
1482 if ( '/' === $html[ $at ] && '>' === $html[ $at + 1 ] ) {
1483 $this->bytes_already_parsed = $at + 2;
1484 return true;
1485 }
1486 }
1487
1488 return false;
1489 }
1490
1491 /**
1492 * Skips contents of script tags.
1493 *
1494 * @since 6.2.0
1495 *
1496 * @return bool Whether the script tag was closed before the end of the document.
1497 */
1498 private function skip_script_data(): bool {
1499 $state = 'unescaped';
1500 $html = $this->html;
1501 $doc_length = strlen( $html );
1502 $at = $this->bytes_already_parsed;
1503
1504 while ( false !== $at && $at < $doc_length ) {
1505 $at += strcspn( $html, '-<', $at );
1506
1507 /*
1508 * Optimization: Terminating a complete script element requires at least eight
1509 * additional bytes in the document. Some checks below may cause local escaped
1510 * state transitions when processing shorter strings, but those transitions are
1511 * irrelevant if the script tag is incomplete and the function must return false.
1512 *
1513 * This may need updating if those transitions become significant or exported from
1514 * this function in some way, such as when building safe methods to embed JavaScript
1515 * or data inside a SCRIPT element.
1516 *
1517 * $at may be here.
1518 * ↓
1519 * ...</script>
1520 * ╰──┬───╯
1521 * $at + 8 additional bytes are required for a non-false return value.
1522 *
1523 * This single check eliminates the need to check lengths for the shorter spans:
1524 *
1525 * $at may be here.
1526 * ↓
1527 * <script><!-- --></script>
1528 * β”œβ•―
1529 * $at + 2 additional characters does not require a length check.
1530 *
1531 * The transition from "escaped" to "unescaped" is not relevant if the document ends:
1532 *
1533 * $at may be here.
1534 * ↓
1535 * <script><!-- -->[[END-OF-DOCUMENT]]
1536 * ╰──┬───╯
1537 * $at + 8 additional bytes is not satisfied, return false.
1538 */
1539 if ( $at + 8 >= $doc_length ) {
1540 return false;
1541 }
1542
1543 /*
1544 * For all script states a "-->" transitions
1545 * back into the normal unescaped script mode,
1546 * even if that's the current state.
1547 */
1548 if (
1549 '-' === $html[ $at ] &&
1550 '-' === $html[ $at + 1 ] &&
1551 '>' === $html[ $at + 2 ]
1552 ) {
1553 $at += 3;
1554 $state = 'unescaped';
1555 continue;
1556 }
1557
1558 /*
1559 * Everything of interest past here starts with "<".
1560 * Check this character and advance position regardless.
1561 */
1562 if ( '<' !== $html[ $at++ ] ) {
1563 continue;
1564 }
1565
1566 /*
1567 * "<!--" only transitions from _unescaped_ to _escaped_. This byte sequence is only
1568 * significant in the _unescaped_ state and is ignored in any other state.
1569 */
1570 if (
1571 'unescaped' === $state &&
1572 '!' === $html[ $at ] &&
1573 '-' === $html[ $at + 1 ] &&
1574 '-' === $html[ $at + 2 ]
1575 ) {
1576 $at += 3;
1577
1578 /*
1579 * The parser is ready to enter the _escaped_ state, but may remain in the
1580 * _unescaped_ state. This occurs when "<!--" is immediately followed by a
1581 * sequence of 0 or more "-" followed by ">". This is similar to abruptly closed
1582 * HTML comments like "<!-->" or "<!--->".
1583 *
1584 * Note that this check may advance the position significantly and requires a
1585 * length check to prevent bad offsets on inputs like `<script><!---------`.
1586 */
1587 $at += strspn( $html, '-', $at );
1588 if ( $at < $doc_length && '>' === $html[ $at ] ) {
1589 ++$at;
1590 continue;
1591 }
1592
1593 $state = 'escaped';
1594 continue;
1595 }
1596
1597 if ( '/' === $html[ $at ] ) {
1598 $closer_potentially_starts_at = $at - 1;
1599 $is_closing = true;
1600 ++$at;
1601 } else {
1602 $is_closing = false;
1603 }
1604
1605 /*
1606 * At this point the only remaining state-changes occur with the
1607 * <script> and </script> tags; unless one of these appears next,
1608 * proceed scanning to the next potential token in the text.
1609 */
1610 if ( ! (
1611 ( 's' === $html[ $at ] || 'S' === $html[ $at ] ) &&
1612 ( 'c' === $html[ $at + 1 ] || 'C' === $html[ $at + 1 ] ) &&
1613 ( 'r' === $html[ $at + 2 ] || 'R' === $html[ $at + 2 ] ) &&
1614 ( 'i' === $html[ $at + 3 ] || 'I' === $html[ $at + 3 ] ) &&
1615 ( 'p' === $html[ $at + 4 ] || 'P' === $html[ $at + 4 ] ) &&
1616 ( 't' === $html[ $at + 5 ] || 'T' === $html[ $at + 5 ] )
1617 ) ) {
1618 ++$at;
1619 continue;
1620 }
1621
1622 /*
1623 * Ensure that the script tag terminates to avoid matching on
1624 * substrings of a non-match. For example, the sequence
1625 * "<script123" should not end a script region even though
1626 * "<script" is found within the text.
1627 */
1628 $at += 6;
1629 $c = $html[ $at ];
1630 if (
1631 /**
1632 * These characters trigger state transitions of interest:
1633 *
1634 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-end-tag-name-state}
1635 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-escaped-end-tag-name-state}
1636 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-double-escape-start-state}
1637 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-double-escape-end-state}
1638 *
1639 * The "\r" character is not present in the above references. However, "\r" must be
1640 * treated the same as "\n". This is because the HTML Standard requires newline
1641 * normalization during preprocessing which applies this replacement.
1642 *
1643 * - @see https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
1644 * - @see https://infra.spec.whatwg.org/#normalize-newlines
1645 */
1646 '>' !== $c &&
1647 ' ' !== $c &&
1648 "\n" !== $c &&
1649 '/' !== $c &&
1650 "\t" !== $c &&
1651 "\f" !== $c &&
1652 "\r" !== $c
1653 ) {
1654 continue;
1655 }
1656
1657 if ( 'escaped' === $state && ! $is_closing ) {
1658 $state = 'double-escaped';
1659 continue;
1660 }
1661
1662 if ( 'double-escaped' === $state && $is_closing ) {
1663 $state = 'escaped';
1664 continue;
1665 }
1666
1667 if ( $is_closing ) {
1668 $this->bytes_already_parsed = $closer_potentially_starts_at;
1669 $this->tag_name_starts_at = $closer_potentially_starts_at;
1670 if ( $this->bytes_already_parsed >= $doc_length ) {
1671 return false;
1672 }
1673
1674 while ( $this->parse_next_attribute() ) {
1675 continue;
1676 }
1677
1678 if ( $this->bytes_already_parsed >= $doc_length ) {
1679 return false;
1680 }
1681
1682 if ( '>' === $html[ $this->bytes_already_parsed ] ) {
1683 ++$this->bytes_already_parsed;
1684 return true;
1685 }
1686 }
1687
1688 ++$at;
1689 }
1690
1691 return false;
1692 }
1693
1694 /**
1695 * Parses the next tag.
1696 *
1697 * This will find and start parsing the next tag, including
1698 * the opening `<`, the potential closer `/`, and the tag
1699 * name. It does not parse the attributes or scan to the
1700 * closing `>`; these are left for other methods.
1701 *
1702 * @since 6.2.0
1703 * @since 6.2.1 Support abruptly-closed comments, invalid-tag-closer-comments, and empty elements.
1704 *
1705 * @return bool Whether a tag was found before the end of the document.
1706 */
1707 private function parse_next_tag(): bool {
1708 $this->after_tag();
1709
1710 $html = $this->html;
1711 $doc_length = strlen( $html );
1712 $was_at = $this->bytes_already_parsed;
1713 $at = $was_at;
1714
1715 while ( $at < $doc_length ) {
1716 $at = strpos( $html, '<', $at );
1717 if ( false === $at ) {
1718 break;
1719 }
1720
1721 if ( $at > $was_at ) {
1722 /*
1723 * A "<" normally starts a new HTML tag or syntax token, but in cases where the
1724 * following character can't produce a valid token, the "<" is instead treated
1725 * as plaintext and the parser should skip over it. This avoids a problem when
1726 * following earlier practices of typing emoji with text, e.g. "<3". This
1727 * should be a heart, not a tag. It's supposed to be rendered, not hidden.
1728 *
1729 * At this point the parser checks if this is one of those cases and if it is
1730 * will continue searching for the next "<" in search of a token boundary.
1731 *
1732 * @see https://html.spec.whatwg.org/#tag-open-state
1733 */
1734 if ( 1 !== strspn( $html, '!/?abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1, 1 ) ) {
1735 ++$at;
1736 continue;
1737 }
1738
1739 $this->parser_state = self::STATE_TEXT_NODE;
1740 $this->token_starts_at = $was_at;
1741 $this->token_length = $at - $was_at;
1742 $this->text_starts_at = $was_at;
1743 $this->text_length = $this->token_length;
1744 $this->bytes_already_parsed = $at;
1745 return true;
1746 }
1747
1748 $this->token_starts_at = $at;
1749
1750 if ( $at + 1 < $doc_length && '/' === $this->html[ $at + 1 ] ) {
1751 $this->is_closing_tag = true;
1752 ++$at;
1753 } else {
1754 $this->is_closing_tag = false;
1755 }
1756
1757 /*
1758 * HTML tag names must start with [a-zA-Z] otherwise they are not tags.
1759 * For example, "<3" is rendered as text, not a tag opener. If at least
1760 * one letter follows the "<" then _it is_ a tag, but if the following
1761 * character is anything else it _is not a tag_.
1762 *
1763 * It's not uncommon to find non-tags starting with `<` in an HTML
1764 * document, so it's good for performance to make this pre-check before
1765 * continuing to attempt to parse a tag name.
1766 *
1767 * Reference:
1768 * * https://html.spec.whatwg.org/multipage/parsing.html#data-state
1769 * * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
1770 */
1771 $tag_name_prefix_length = strspn( $html, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1 );
1772 if ( $tag_name_prefix_length > 0 ) {
1773 ++$at;
1774 $this->parser_state = self::STATE_MATCHED_TAG;
1775 $this->tag_name_starts_at = $at;
1776 $this->tag_name_length = $tag_name_prefix_length + strcspn( $html, " \t\f\r\n/>", $at + $tag_name_prefix_length );
1777 $this->bytes_already_parsed = $at + $this->tag_name_length;
1778 return true;
1779 }
1780
1781 /*
1782 * Abort if no tag is found before the end of
1783 * the document. There is nothing left to parse.
1784 */
1785 if ( $at + 1 >= $doc_length ) {
1786 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1787
1788 return false;
1789 }
1790
1791 /*
1792 * `<!` transitions to markup declaration open state
1793 * https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state
1794 */
1795 if ( ! $this->is_closing_tag && '!' === $html[ $at + 1 ] ) {
1796 /*
1797 * `<!--` transitions to a comment state – apply further comment rules.
1798 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
1799 */
1800 if ( 0 === substr_compare( $html, '--', $at + 2, 2 ) ) {
1801 $closer_at = $at + 4;
1802 // If it's not possible to close the comment then there is nothing more to scan.
1803 if ( $doc_length <= $closer_at ) {
1804 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1805
1806 return false;
1807 }
1808
1809 // Abruptly-closed empty comments are a sequence of dashes followed by `>`.
1810 $span_of_dashes = strspn( $html, '-', $closer_at );
1811 if ( '>' === $html[ $closer_at + $span_of_dashes ] ) {
1812 /*
1813 * @todo When implementing `set_modifiable_text()` ensure that updates to this token
1814 * don't break the syntax for short comments, e.g. `<!--->`. Unlike other comment
1815 * and bogus comment syntax, these leave no clear insertion point for text and
1816 * they need to be modified specially in order to contain text. E.g. to store
1817 * `?` as the modifiable text, the `<!--->` needs to become `<!--?-->`, which
1818 * involves inserting an additional `-` into the token after the modifiable text.
1819 */
1820 $this->parser_state = self::STATE_COMMENT;
1821 $this->comment_type = self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT;
1822 $this->token_length = $closer_at + $span_of_dashes + 1 - $this->token_starts_at;
1823
1824 // Only provide modifiable text if the token is long enough to contain it.
1825 if ( $span_of_dashes >= 2 ) {
1826 $this->comment_type = self::COMMENT_AS_HTML_COMMENT;
1827 $this->text_starts_at = $this->token_starts_at + 4;
1828 $this->text_length = $span_of_dashes - 2;
1829 }
1830
1831 $this->bytes_already_parsed = $closer_at + $span_of_dashes + 1;
1832 return true;
1833 }
1834
1835 /*
1836 * Comments may be closed by either a --> or an invalid --!>.
1837 * The first occurrence closes the comment.
1838 *
1839 * See https://html.spec.whatwg.org/#parse-error-incorrectly-closed-comment
1840 */
1841 --$closer_at; // Pre-increment inside condition below reduces risk of accidental infinite looping.
1842 while ( ++$closer_at < $doc_length ) {
1843 $closer_at = strpos( $html, '--', $closer_at );
1844 if ( false === $closer_at ) {
1845 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1846
1847 return false;
1848 }
1849
1850 if ( $closer_at + 2 < $doc_length && '>' === $html[ $closer_at + 2 ] ) {
1851 $this->parser_state = self::STATE_COMMENT;
1852 $this->comment_type = self::COMMENT_AS_HTML_COMMENT;
1853 $this->token_length = $closer_at + 3 - $this->token_starts_at;
1854 $this->text_starts_at = $this->token_starts_at + 4;
1855 $this->text_length = $closer_at - $this->text_starts_at;
1856 $this->bytes_already_parsed = $closer_at + 3;
1857 return true;
1858 }
1859
1860 if (
1861 $closer_at + 3 < $doc_length &&
1862 '!' === $html[ $closer_at + 2 ] &&
1863 '>' === $html[ $closer_at + 3 ]
1864 ) {
1865 $this->parser_state = self::STATE_COMMENT;
1866 $this->comment_type = self::COMMENT_AS_HTML_COMMENT;
1867 $this->token_length = $closer_at + 4 - $this->token_starts_at;
1868 $this->text_starts_at = $this->token_starts_at + 4;
1869 $this->text_length = $closer_at - $this->text_starts_at;
1870 $this->bytes_already_parsed = $closer_at + 4;
1871 return true;
1872 }
1873 }
1874 }
1875
1876 /*
1877 * `<!DOCTYPE` transitions to DOCTYPE state – skip to the nearest >
1878 * These are ASCII-case-insensitive.
1879 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
1880 */
1881 if (
1882 $doc_length > $at + 8 &&
1883 ( 'D' === $html[ $at + 2 ] || 'd' === $html[ $at + 2 ] ) &&
1884 ( 'O' === $html[ $at + 3 ] || 'o' === $html[ $at + 3 ] ) &&
1885 ( 'C' === $html[ $at + 4 ] || 'c' === $html[ $at + 4 ] ) &&
1886 ( 'T' === $html[ $at + 5 ] || 't' === $html[ $at + 5 ] ) &&
1887 ( 'Y' === $html[ $at + 6 ] || 'y' === $html[ $at + 6 ] ) &&
1888 ( 'P' === $html[ $at + 7 ] || 'p' === $html[ $at + 7 ] ) &&
1889 ( 'E' === $html[ $at + 8 ] || 'e' === $html[ $at + 8 ] )
1890 ) {
1891 $closer_at = strpos( $html, '>', $at + 9 );
1892 if ( false === $closer_at ) {
1893 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1894
1895 return false;
1896 }
1897
1898 $this->parser_state = self::STATE_DOCTYPE;
1899 $this->token_length = $closer_at + 1 - $this->token_starts_at;
1900 $this->text_starts_at = $this->token_starts_at + 9;
1901 $this->text_length = $closer_at - $this->text_starts_at;
1902 $this->bytes_already_parsed = $closer_at + 1;
1903 return true;
1904 }
1905
1906 if (
1907 'html' !== $this->parsing_namespace &&
1908 strlen( $html ) > $at + 8 &&
1909 '[' === $html[ $at + 2 ] &&
1910 'C' === $html[ $at + 3 ] &&
1911 'D' === $html[ $at + 4 ] &&
1912 'A' === $html[ $at + 5 ] &&
1913 'T' === $html[ $at + 6 ] &&
1914 'A' === $html[ $at + 7 ] &&
1915 '[' === $html[ $at + 8 ]
1916 ) {
1917 $closer_at = strpos( $html, ']]>', $at + 9 );
1918 if ( false === $closer_at ) {
1919 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1920
1921 return false;
1922 }
1923
1924 $this->parser_state = self::STATE_CDATA_NODE;
1925 $this->text_starts_at = $at + 9;
1926 $this->text_length = $closer_at - $this->text_starts_at;
1927 $this->token_length = $closer_at + 3 - $this->token_starts_at;
1928 $this->bytes_already_parsed = $closer_at + 3;
1929 return true;
1930 }
1931
1932 /*
1933 * Anything else here is an incorrectly-opened comment and transitions
1934 * to the bogus comment state - skip to the nearest >. If no closer is
1935 * found then the HTML was truncated inside the markup declaration.
1936 */
1937 $closer_at = strpos( $html, '>', $at + 1 );
1938 if ( false === $closer_at ) {
1939 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1940
1941 return false;
1942 }
1943
1944 $this->parser_state = self::STATE_COMMENT;
1945 $this->comment_type = self::COMMENT_AS_INVALID_HTML;
1946 $this->token_length = $closer_at + 1 - $this->token_starts_at;
1947 $this->text_starts_at = $this->token_starts_at + 2;
1948 $this->text_length = $closer_at - $this->text_starts_at;
1949 $this->bytes_already_parsed = $closer_at + 1;
1950
1951 /*
1952 * Identify nodes that would be CDATA if HTML had CDATA sections.
1953 *
1954 * This section must occur after identifying the bogus comment end
1955 * because in an HTML parser it will span to the nearest `>`, even
1956 * if there's no `]]>` as would be required in an XML document. It
1957 * is therefore not possible to parse a CDATA section containing
1958 * a `>` in the HTML syntax.
1959 *
1960 * Inside foreign elements there is a discrepancy between browsers
1961 * and the specification on this.
1962 *
1963 * @todo Track whether the Tag Processor is inside a foreign element
1964 * and require the proper closing `]]>` in those cases.
1965 */
1966 if (
1967 $this->token_length >= 10 &&
1968 '[' === $html[ $this->token_starts_at + 2 ] &&
1969 'C' === $html[ $this->token_starts_at + 3 ] &&
1970 'D' === $html[ $this->token_starts_at + 4 ] &&
1971 'A' === $html[ $this->token_starts_at + 5 ] &&
1972 'T' === $html[ $this->token_starts_at + 6 ] &&
1973 'A' === $html[ $this->token_starts_at + 7 ] &&
1974 '[' === $html[ $this->token_starts_at + 8 ] &&
1975 ']' === $html[ $closer_at - 1 ] &&
1976 ']' === $html[ $closer_at - 2 ]
1977 ) {
1978 $this->parser_state = self::STATE_COMMENT;
1979 $this->comment_type = self::COMMENT_AS_CDATA_LOOKALIKE;
1980 $this->text_starts_at += 7;
1981 $this->text_length -= 9;
1982 }
1983
1984 return true;
1985 }
1986
1987 /*
1988 * </> is a missing end tag name, which is ignored.
1989 *
1990 * This was also known as the "presumptuous empty tag"
1991 * in early discussions as it was proposed to close
1992 * the nearest previous opening tag.
1993 *
1994 * See https://html.spec.whatwg.org/#parse-error-missing-end-tag-name
1995 */
1996 if ( '>' === $html[ $at + 1 ] ) {
1997 // `<>` is interpreted as plaintext.
1998 if ( ! $this->is_closing_tag ) {
1999 ++$at;
2000 continue;
2001 }
2002
2003 $this->parser_state = self::STATE_PRESUMPTUOUS_TAG;
2004 $this->token_length = $at + 2 - $this->token_starts_at;
2005 $this->bytes_already_parsed = $at + 2;
2006 return true;
2007 }
2008
2009 /*
2010 * `<?` transitions to a bogus comment state – skip to the nearest >
2011 * See https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
2012 */
2013 if ( ! $this->is_closing_tag && '?' === $html[ $at + 1 ] ) {
2014 $closer_at = strpos( $html, '>', $at + 2 );
2015 if ( false === $closer_at ) {
2016 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
2017
2018 return false;
2019 }
2020
2021 $this->parser_state = self::STATE_COMMENT;
2022 $this->comment_type = self::COMMENT_AS_INVALID_HTML;
2023 $this->token_length = $closer_at + 1 - $this->token_starts_at;
2024 $this->text_starts_at = $this->token_starts_at + 2;
2025 $this->text_length = $closer_at - $this->text_starts_at;
2026 $this->bytes_already_parsed = $closer_at + 1;
2027
2028 /*
2029 * Identify a Processing Instruction node were HTML to have them.
2030 *
2031 * This section must occur after identifying the bogus comment end
2032 * because in an HTML parser it will span to the nearest `>`, even
2033 * if there's no `?>` as would be required in an XML document. It
2034 * is therefore not possible to parse a Processing Instruction node
2035 * containing a `>` in the HTML syntax.
2036 *
2037 * XML allows for more target names, but this code only identifies
2038 * those with ASCII-representable target names. This means that it
2039 * may identify some Processing Instruction nodes as bogus comments,
2040 * but it will not misinterpret the HTML structure. By limiting the
2041 * identification to these target names the Tag Processor can avoid
2042 * the need to start parsing UTF-8 sequences.
2043 *
2044 * > NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
2045 * [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
2046 * [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
2047 * [#x10000-#xEFFFF]
2048 * > NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
2049 *
2050 * @todo Processing instruction nodes in SGML may contain any kind of markup. XML defines a
2051 * special case with `<?xml ... ?>` syntax, but the `?` is part of the bogus comment.
2052 *
2053 * @see https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget
2054 */
2055 if ( $this->token_length >= 5 && '?' === $html[ $closer_at - 1 ] ) {
2056 $comment_text = substr( $html, $this->token_starts_at + 2, $this->token_length - 4 );
2057 $pi_target_length = strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_' );
2058
2059 if ( 0 < $pi_target_length ) {
2060 $pi_target_length += strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-.', $pi_target_length );
2061
2062 $this->comment_type = self::COMMENT_AS_PI_NODE_LOOKALIKE;
2063 $this->tag_name_starts_at = $this->token_starts_at + 2;
2064 $this->tag_name_length = $pi_target_length;
2065 $this->text_starts_at += $pi_target_length;
2066 $this->text_length -= $pi_target_length + 1;
2067 }
2068 }
2069
2070 return true;
2071 }
2072
2073 /*
2074 * If a non-alpha starts the tag name in a tag closer it's a comment.
2075 * Find the first `>`, which closes the comment.
2076 *
2077 * This parser classifies these particular comments as special "funky comments"
2078 * which are made available for further processing.
2079 *
2080 * See https://html.spec.whatwg.org/#parse-error-invalid-first-character-of-tag-name
2081 */
2082 if ( $this->is_closing_tag ) {
2083 // No chance of finding a closer.
2084 if ( $at + 3 > $doc_length ) {
2085 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
2086
2087 return false;
2088 }
2089
2090 $closer_at = strpos( $html, '>', $at + 2 );
2091 if ( false === $closer_at ) {
2092 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
2093
2094 return false;
2095 }
2096
2097 $this->parser_state = self::STATE_FUNKY_COMMENT;
2098 $this->token_length = $closer_at + 1 - $this->token_starts_at;
2099 $this->text_starts_at = $this->token_starts_at + 2;
2100 $this->text_length = $closer_at - $this->text_starts_at;
2101 $this->bytes_already_parsed = $closer_at + 1;
2102 return true;
2103 }
2104
2105 ++$at;
2106 }
2107
2108 /*
2109 * This does not imply an incomplete parse; it indicates that there
2110 * can be nothing left in the document other than a #text node.
2111 */
2112 $this->parser_state = self::STATE_TEXT_NODE;
2113 $this->token_starts_at = $was_at;
2114 $this->token_length = $doc_length - $was_at;
2115 $this->text_starts_at = $was_at;
2116 $this->text_length = $this->token_length;
2117 $this->bytes_already_parsed = $doc_length;
2118 return true;
2119 }
2120
2121 /**
2122 * Parses the next attribute.
2123 *
2124 * @since 6.2.0
2125 *
2126 * @return bool Whether an attribute was found before the end of the document.
2127 */
2128 private function parse_next_attribute(): bool {
2129 $doc_length = strlen( $this->html );
2130
2131 // Skip whitespace and slashes.
2132 $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n/", $this->bytes_already_parsed );
2133 if ( $this->bytes_already_parsed >= $doc_length ) {
2134 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
2135
2136 return false;
2137 }
2138
2139 /*
2140 * Treat the equal sign as a part of the attribute
2141 * name if it is the first encountered byte.
2142 *
2143 * @see https://html.spec.whatwg.org/multipage/parsing.html#before-attribute-name-state
2144 */
2145 $name_length = '=' === $this->html[ $this->bytes_already_parsed ]
2146 ? 1 + strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed + 1 )
2147 : strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed );
2148
2149 // No attribute, just tag closer.
2150 if ( 0 === $name_length || $this->bytes_already_parsed + $name_length >= $doc_length ) {
2151 return false;
2152 }
2153
2154 $attribute_start = $this->bytes_already_parsed;
2155 $attribute_name = substr( $this->html, $attribute_start, $name_length );
2156 $this->bytes_already_parsed += $name_length;
2157 if ( $this->bytes_already_parsed >= $doc_length ) {
2158 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
2159
2160 return false;
2161 }
2162
2163 $this->skip_whitespace();
2164 if ( $this->bytes_already_parsed >= $doc_length ) {
2165 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
2166
2167 return false;
2168 }
2169
2170 $has_value = '=' === $this->html[ $this->bytes_already_parsed ];
2171 if ( $has_value ) {
2172 ++$this->bytes_already_parsed;
2173 $this->skip_whitespace();
2174 if ( $this->bytes_already_parsed >= $doc_length ) {
2175 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
2176
2177 return false;
2178 }
2179
2180 switch ( $this->html[ $this->bytes_already_parsed ] ) {
2181 case "'":
2182 case '"':
2183 $quote = $this->html[ $this->bytes_already_parsed ];
2184 $value_start = $this->bytes_already_parsed + 1;
2185 $end_quote_at = strpos( $this->html, $quote, $value_start );
2186 $end_quote_at = false === $end_quote_at ? $doc_length : $end_quote_at;
2187 $value_length = $end_quote_at - $value_start;
2188 $attribute_end = $end_quote_at + 1;
2189 $this->bytes_already_parsed = $attribute_end;
2190 break;
2191
2192 default:
2193 $value_start = $this->bytes_already_parsed;
2194 $value_length = strcspn( $this->html, "> \t\f\r\n", $value_start );
2195 $attribute_end = $value_start + $value_length;
2196 $this->bytes_already_parsed = $attribute_end;
2197 }
2198 } else {
2199 $value_start = $this->bytes_already_parsed;
2200 $value_length = 0;
2201 $attribute_end = $attribute_start + $name_length;
2202 }
2203
2204 if ( $attribute_end >= $doc_length ) {
2205 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
2206
2207 return false;
2208 }
2209
2210 if ( $this->is_closing_tag ) {
2211 return true;
2212 }
2213
2214 /*
2215 * > There must never be two or more attributes on
2216 * > the same start tag whose names are an ASCII
2217 * > case-insensitive match for each other.
2218 * - HTML 5 spec
2219 *
2220 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive
2221 */
2222 $comparable_name = strtolower( $attribute_name );
2223
2224 // If an attribute is listed many times, only use the first declaration and ignore the rest.
2225 if ( ! isset( $this->attributes[ $comparable_name ] ) ) {
2226 $this->attributes[ $comparable_name ] = new WP_HTML_Attribute_Token(
2227 $attribute_name,
2228 $value_start,
2229 $value_length,
2230 $attribute_start,
2231 $attribute_end - $attribute_start,
2232 ! $has_value
2233 );
2234
2235 return true;
2236 }
2237
2238 /*
2239 * Track the duplicate attributes so if we remove it, all disappear together.
2240 *
2241 * While `$this->duplicated_attributes` could always be stored as an `array()`,
2242 * which would simplify the logic here, storing a `null` and only allocating
2243 * an array when encountering duplicates avoids needless allocations in the
2244 * normative case of parsing tags with no duplicate attributes.
2245 */
2246 $duplicate_span = new WP_HTML_Span( $attribute_start, $attribute_end - $attribute_start );
2247 if ( null === $this->duplicate_attributes ) {
2248 $this->duplicate_attributes = array( $comparable_name => array( $duplicate_span ) );
2249 } elseif ( ! isset( $this->duplicate_attributes[ $comparable_name ] ) ) {
2250 $this->duplicate_attributes[ $comparable_name ] = array( $duplicate_span );
2251 } else {
2252 $this->duplicate_attributes[ $comparable_name ][] = $duplicate_span;
2253 }
2254
2255 return true;
2256 }
2257
2258 /**
2259 * Move the internal cursor past any immediate successive whitespace.
2260 *
2261 * @since 6.2.0
2262 */
2263 private function skip_whitespace(): void {
2264 $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n", $this->bytes_already_parsed );
2265 }
2266
2267 /**
2268 * Applies attribute updates and cleans up once a tag is fully parsed.
2269 *
2270 * @since 6.2.0
2271 */
2272 private function after_tag(): void {
2273 /*
2274 * There could be lexical updates enqueued for an attribute that
2275 * also exists on the next tag. In order to avoid conflating the
2276 * attributes across the two tags, lexical updates with names
2277 * need to be flushed to raw lexical updates.
2278 */
2279 $this->class_name_updates_to_attributes_updates();
2280
2281 /*
2282 * Purge updates if there are too many. The actual count isn't
2283 * scientific, but a few values from 100 to a few thousand were
2284 * tests to find a practically-useful limit.
2285 *
2286 * If the update queue grows too big, then the Tag Processor
2287 * will spend more time iterating through them and lose the
2288 * efficiency gains of deferring applying them.
2289 */
2290 if ( 1000 < count( $this->lexical_updates ) ) {
2291 $this->get_updated_html();
2292 }
2293
2294 foreach ( $this->lexical_updates as $name => $update ) {
2295 /*
2296 * Any updates appearing after the cursor should be applied
2297 * before proceeding, otherwise they may be overlooked.
2298 */
2299 if ( $update->start >= $this->bytes_already_parsed ) {
2300 $this->get_updated_html();
2301 break;
2302 }
2303
2304 if ( is_int( $name ) ) {
2305 continue;
2306 }
2307
2308 $this->lexical_updates[] = $update;
2309 unset( $this->lexical_updates[ $name ] );
2310 }
2311
2312 $this->token_starts_at = null;
2313 $this->token_length = null;
2314 $this->tag_name_starts_at = null;
2315 $this->tag_name_length = null;
2316 $this->text_starts_at = 0;
2317 $this->text_length = 0;
2318 $this->is_closing_tag = null;
2319 $this->attributes = array();
2320 $this->comment_type = null;
2321 $this->text_node_classification = self::TEXT_IS_GENERIC;
2322 $this->duplicate_attributes = null;
2323 }
2324
2325 /**
2326 * Converts class name updates into tag attributes updates
2327 * (they are accumulated in different data formats for performance).
2328 *
2329 * @since 6.2.0
2330 *
2331 * @see WP_HTML_Tag_Processor::$lexical_updates
2332 * @see WP_HTML_Tag_Processor::$classname_updates
2333 */
2334 private function class_name_updates_to_attributes_updates(): void {
2335 if ( count( $this->classname_updates ) === 0 ) {
2336 return;
2337 }
2338
2339 $existing_class = $this->get_enqueued_attribute_value( 'class' );
2340 if ( null === $existing_class || true === $existing_class ) {
2341 $existing_class = '';
2342 }
2343
2344 if ( false === $existing_class && isset( $this->attributes['class'] ) ) {
2345 $existing_class = WP_HTML_Decoder::decode_attribute(
2346 substr(
2347 $this->html,
2348 $this->attributes['class']->value_starts_at,
2349 $this->attributes['class']->value_length
2350 )
2351 );
2352 }
2353
2354 if ( false === $existing_class ) {
2355 $existing_class = '';
2356 }
2357
2358 /**
2359 * Updated "class" attribute value.
2360 *
2361 * This is incrementally built while scanning through the existing class
2362 * attribute, skipping removed classes on the way, and then appending
2363 * added classes at the end. Only when finished processing will the
2364 * value contain the final new value.
2365
2366 * @var string $class
2367 */
2368 $class = '';
2369
2370 /**
2371 * Tracks the cursor position in the existing
2372 * class attribute value while parsing.
2373 *
2374 * @var int $at
2375 */
2376 $at = 0;
2377
2378 /**
2379 * Indicates if there's any need to modify the existing class attribute.
2380 *
2381 * If a call to `add_class()` and `remove_class()` wouldn't impact
2382 * the `class` attribute value then there's no need to rebuild it.
2383 * For example, when adding a class that's already present or
2384 * removing one that isn't.
2385 *
2386 * This flag enables a performance optimization when none of the enqueued
2387 * class updates would impact the `class` attribute; namely, that the
2388 * processor can continue without modifying the input document, as if
2389 * none of the `add_class()` or `remove_class()` calls had been made.
2390 *
2391 * This flag is set upon the first change that requires a string update.
2392 *
2393 * @var bool $modified
2394 */
2395 $modified = false;
2396
2397 $seen = array();
2398 $to_remove = array();
2399 $is_quirks = self::QUIRKS_MODE === $this->compat_mode;
2400 if ( $is_quirks ) {
2401 foreach ( $this->classname_updates as $updated_name => $action ) {
2402 if ( self::REMOVE_CLASS === $action ) {
2403 $to_remove[] = strtolower( $updated_name );
2404 }
2405 }
2406 } else {
2407 foreach ( $this->classname_updates as $updated_name => $action ) {
2408 if ( self::REMOVE_CLASS === $action ) {
2409 $to_remove[] = $updated_name;
2410 }
2411 }
2412 }
2413
2414 // Remove unwanted classes by only copying the new ones.
2415 $existing_class_length = strlen( $existing_class );
2416 while ( $at < $existing_class_length ) {
2417 // Skip to the first non-whitespace character.
2418 $ws_at = $at;
2419 $ws_length = strspn( $existing_class, " \t\f\r\n", $ws_at );
2420 $at += $ws_length;
2421
2422 // Capture the class name – it's everything until the next whitespace.
2423 $name_length = strcspn( $existing_class, " \t\f\r\n", $at );
2424 if ( 0 === $name_length ) {
2425 // If no more class names are found then that's the end.
2426 break;
2427 }
2428
2429 $name = substr( $existing_class, $at, $name_length );
2430 $comparable_class_name = $is_quirks ? strtolower( $name ) : $name;
2431 $at += $name_length;
2432
2433 // If this class is marked for removal, remove it and move on to the next one.
2434 if ( in_array( $comparable_class_name, $to_remove, true ) ) {
2435 $modified = true;
2436 continue;
2437 }
2438
2439 // If a class has already been seen then skip it; it should not be added twice.
2440 if ( in_array( $comparable_class_name, $seen, true ) ) {
2441 continue;
2442 }
2443
2444 $seen[] = $comparable_class_name;
2445
2446 /*
2447 * Otherwise, append it to the new "class" attribute value.
2448 *
2449 * There are options for handling whitespace between tags.
2450 * Preserving the existing whitespace produces fewer changes
2451 * to the HTML content and should clarify the before/after
2452 * content when debugging the modified output.
2453 *
2454 * This approach contrasts normalizing the inter-class
2455 * whitespace to a single space, which might appear cleaner
2456 * in the output HTML but produce a noisier change.
2457 */
2458 if ( '' !== $class ) {
2459 $class .= substr( $existing_class, $ws_at, $ws_length );
2460 }
2461 $class .= $name;
2462 }
2463
2464 // Add new classes by appending those which haven't already been seen.
2465 foreach ( $this->classname_updates as $name => $operation ) {
2466 $comparable_name = $is_quirks ? strtolower( $name ) : $name;
2467 if ( self::ADD_CLASS === $operation && ! in_array( $comparable_name, $seen, true ) ) {
2468 $modified = true;
2469
2470 $class .= strlen( $class ) > 0 ? ' ' : '';
2471 $class .= $name;
2472 }
2473 }
2474
2475 $this->classname_updates = array();
2476 if ( ! $modified ) {
2477 return;
2478 }
2479
2480 if ( strlen( $class ) > 0 ) {
2481 $this->set_attribute( 'class', $class );
2482 } else {
2483 $this->remove_attribute( 'class' );
2484 }
2485 }
2486
2487 /**
2488 * Applies attribute updates to HTML document.
2489 *
2490 * @since 6.2.0
2491 * @since 6.2.1 Accumulates shift for internal cursor and passed pointer.
2492 * @since 6.3.0 Invalidate any bookmarks whose targets are overwritten.
2493 *
2494 * @param int $shift_this_point Accumulate and return shift for this position.
2495 * @return int How many bytes the given pointer moved in response to the updates.
2496 */
2497 private function apply_attributes_updates( int $shift_this_point ): int {
2498 if ( ! count( $this->lexical_updates ) ) {
2499 return 0;
2500 }
2501
2502 $accumulated_shift_for_given_point = 0;
2503
2504 /*
2505 * Attribute updates can be enqueued in any order but updates
2506 * to the document must occur in lexical order; that is, each
2507 * replacement must be made before all others which follow it
2508 * at later string indices in the input document.
2509 *
2510 * Sorting avoid making out-of-order replacements which
2511 * can lead to mangled output, partially-duplicated
2512 * attributes, and overwritten attributes.
2513 */
2514 usort( $this->lexical_updates, array( self::class, 'sort_start_ascending' ) );
2515
2516 $bytes_already_copied = 0;
2517 $output_buffer = '';
2518 foreach ( $this->lexical_updates as $diff ) {
2519 $shift = strlen( $diff->text ) - $diff->length;
2520
2521 // Adjust the cursor position by however much an update affects it.
2522 if ( $diff->start < $this->bytes_already_parsed ) {
2523 $this->bytes_already_parsed += $shift;
2524 }
2525
2526 // Accumulate shift of the given pointer within this function call.
2527 if ( $diff->start < $shift_this_point ) {
2528 $accumulated_shift_for_given_point += $shift;
2529 }
2530
2531 $output_buffer .= substr( $this->html, $bytes_already_copied, $diff->start - $bytes_already_copied );
2532 $output_buffer .= $diff->text;
2533 $bytes_already_copied = $diff->start + $diff->length;
2534 }
2535
2536 $this->html = $output_buffer . substr( $this->html, $bytes_already_copied );
2537
2538 /*
2539 * Adjust bookmark locations to account for how the text
2540 * replacements adjust offsets in the input document.
2541 */
2542 foreach ( $this->bookmarks as $bookmark_name => $bookmark ) {
2543 $bookmark_end = $bookmark->start + $bookmark->length;
2544
2545 /*
2546 * Each lexical update which appears before the bookmark's endpoints
2547 * might shift the offsets for those endpoints. Loop through each change
2548 * and accumulate the total shift for each bookmark, then apply that
2549 * shift after tallying the full delta.
2550 */
2551 $head_delta = 0;
2552 $tail_delta = 0;
2553
2554 foreach ( $this->lexical_updates as $diff ) {
2555 $diff_end = $diff->start + $diff->length;
2556
2557 if ( $bookmark->start < $diff->start && $bookmark_end < $diff->start ) {
2558 break;
2559 }
2560
2561 if ( $bookmark->start >= $diff->start && $bookmark_end < $diff_end ) {
2562 $this->release_bookmark( $bookmark_name );
2563 continue 2;
2564 }
2565
2566 $delta = strlen( $diff->text ) - $diff->length;
2567
2568 if ( $bookmark->start >= $diff->start ) {
2569 $head_delta += $delta;
2570 }
2571
2572 if ( $bookmark_end >= $diff_end ) {
2573 $tail_delta += $delta;
2574 }
2575 }
2576
2577 $bookmark->start += $head_delta;
2578 $bookmark->length += $tail_delta - $head_delta;
2579 }
2580
2581 $this->lexical_updates = array();
2582
2583 return $accumulated_shift_for_given_point;
2584 }
2585
2586 /**
2587 * Checks whether a bookmark with the given name exists.
2588 *
2589 * @since 6.3.0
2590 *
2591 * @param string $bookmark_name Name to identify a bookmark that potentially exists.
2592 * @return bool Whether that bookmark exists.
2593 */
2594 public function has_bookmark( $bookmark_name ): bool {
2595 return array_key_exists( $bookmark_name, $this->bookmarks );
2596 }
2597
2598 /**
2599 * Move the internal cursor in the Tag Processor to a given bookmark's location.
2600 *
2601 * In order to prevent accidental infinite loops, there's a
2602 * maximum limit on the number of times seek() can be called.
2603 *
2604 * @since 6.2.0
2605 *
2606 * @param string $bookmark_name Jump to the place in the document identified by this bookmark name.
2607 * @return bool Whether the internal cursor was successfully moved to the bookmark's location.
2608 */
2609 public function seek( $bookmark_name ): bool {
2610 if ( ! array_key_exists( $bookmark_name, $this->bookmarks ) ) {
2611 _doing_it_wrong(
2612 __METHOD__,
2613 __( 'Unknown bookmark name.' ),
2614 '6.2.0'
2615 );
2616 return false;
2617 }
2618
2619 $existing_bookmark = $this->bookmarks[ $bookmark_name ];
2620
2621 if (
2622 $this->token_starts_at === $existing_bookmark->start &&
2623 $this->token_length === $existing_bookmark->length
2624 ) {
2625 return true;
2626 }
2627
2628 if ( ++$this->seek_count > static::MAX_SEEK_OPS ) {
2629 _doing_it_wrong(
2630 __METHOD__,
2631 __( 'Too many calls to seek() - this can lead to performance issues.' ),
2632 '6.2.0'
2633 );
2634 return false;
2635 }
2636
2637 // Flush out any pending updates to the document.
2638 $this->get_updated_html();
2639
2640 // Point this tag processor before the sought tag opener and consume it.
2641 $this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start;
2642 $this->parser_state = self::STATE_READY;
2643 return $this->next_token();
2644 }
2645
2646 /**
2647 * Compare two WP_HTML_Text_Replacement objects.
2648 *
2649 * @since 6.2.0
2650 *
2651 * @param WP_HTML_Text_Replacement $a First attribute update.
2652 * @param WP_HTML_Text_Replacement $b Second attribute update.
2653 * @return int Comparison value for string order.
2654 */
2655 private static function sort_start_ascending( WP_HTML_Text_Replacement $a, WP_HTML_Text_Replacement $b ): int {
2656 $by_start = $a->start - $b->start;
2657 if ( 0 !== $by_start ) {
2658 return $by_start;
2659 }
2660
2661 $by_text = isset( $a->text, $b->text ) ? strcmp( $a->text, $b->text ) : 0;
2662 if ( 0 !== $by_text ) {
2663 return $by_text;
2664 }
2665
2666 /*
2667 * This code should be unreachable, because it implies the two replacements
2668 * start at the same location and contain the same text.
2669 */
2670 return $a->length - $b->length;
2671 }
2672
2673 /**
2674 * Return the enqueued value for a given attribute, if one exists.
2675 *
2676 * Enqueued updates can take different data types:
2677 * - If an update is enqueued and is boolean, the return will be `true`
2678 * - If an update is otherwise enqueued, the return will be the string value of that update.
2679 * - If an attribute is enqueued to be removed, the return will be `null` to indicate that.
2680 * - If no updates are enqueued, the return will be `false` to differentiate from "removed."
2681 *
2682 * @since 6.2.0
2683 *
2684 * @param string $comparable_name The attribute name in its comparable form.
2685 * @return string|boolean|null Value of enqueued update if present, otherwise false.
2686 */
2687 private function get_enqueued_attribute_value( string $comparable_name ) {
2688 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
2689 return false;
2690 }
2691
2692 if ( ! isset( $this->lexical_updates[ $comparable_name ] ) ) {
2693 return false;
2694 }
2695
2696 $enqueued_text = $this->lexical_updates[ $comparable_name ]->text;
2697
2698 // Removed attributes erase the entire span.
2699 if ( '' === $enqueued_text ) {
2700 return null;
2701 }
2702
2703 /*
2704 * Boolean attribute updates are just the attribute name without a corresponding value.
2705 *
2706 * This value might differ from the given comparable name in that there could be leading
2707 * or trailing whitespace, and that the casing follows the name given in `set_attribute`.
2708 *
2709 * Example:
2710 *
2711 * $p->set_attribute( 'data-TEST-id', 'update' );
2712 * 'update' === $p->get_enqueued_attribute_value( 'data-test-id' );
2713 *
2714 * Detect this difference based on the absence of the `=`, which _must_ exist in any
2715 * attribute containing a value, e.g. `<input type="text" enabled />`.
2716 * ΒΉ Β²
2717 * 1. Attribute with a string value.
2718 * 2. Boolean attribute whose value is `true`.
2719 */
2720 $equals_at = strpos( $enqueued_text, '=' );
2721 if ( false === $equals_at ) {
2722 return true;
2723 }
2724
2725 /*
2726 * Finally, a normal update's value will appear after the `=` and
2727 * be double-quoted, as performed incidentally by `set_attribute`.
2728 *
2729 * e.g. `type="text"`
2730 * ΒΉΒ² Β³
2731 * 1. Equals is here.
2732 * 2. Double-quoting starts one after the equals sign.
2733 * 3. Double-quoting ends at the last character in the update.
2734 */
2735 $enqueued_value = substr( $enqueued_text, $equals_at + 2, -1 );
2736 return WP_HTML_Decoder::decode_attribute( $enqueued_value );
2737 }
2738
2739 /**
2740 * Returns the value of a requested attribute from a matched tag opener if that attribute exists.
2741 *
2742 * Example:
2743 *
2744 * $p = new WP_HTML_Tag_Processor( '<div enabled class="test" data-test-id="14">Test</div>' );
2745 * $p->next_tag( array( 'class_name' => 'test' ) ) === true;
2746 * $p->get_attribute( 'data-test-id' ) === '14';
2747 * $p->get_attribute( 'enabled' ) === true;
2748 * $p->get_attribute( 'aria-label' ) === null;
2749 *
2750 * $p->next_tag() === false;
2751 * $p->get_attribute( 'class' ) === null;
2752 *
2753 * @since 6.2.0
2754 *
2755 * @param string $name Name of attribute whose value is requested.
2756 * @return string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`.
2757 */
2758 public function get_attribute( $name ) {
2759 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
2760 return null;
2761 }
2762
2763 $comparable = strtolower( $name );
2764
2765 /*
2766 * For every attribute other than `class` it's possible to perform a quick check if
2767 * there's an enqueued lexical update whose value takes priority over what's found in
2768 * the input document.
2769 *
2770 * The `class` attribute is special though because of the exposed helpers `add_class`
2771 * and `remove_class`. These form a builder for the `class` attribute, so an additional
2772 * check for enqueued class changes is required in addition to the check for any enqueued
2773 * attribute values. If any exist, those enqueued class changes must first be flushed out
2774 * into an attribute value update.
2775 */
2776 if ( 'class' === $name ) {
2777 $this->class_name_updates_to_attributes_updates();
2778 }
2779
2780 // Return any enqueued attribute value updates if they exist.
2781 $enqueued_value = $this->get_enqueued_attribute_value( $comparable );
2782 if ( false !== $enqueued_value ) {
2783 return $enqueued_value;
2784 }
2785
2786 if ( ! isset( $this->attributes[ $comparable ] ) ) {
2787 return null;
2788 }
2789
2790 $attribute = $this->attributes[ $comparable ];
2791
2792 /*
2793 * This flag distinguishes an attribute with no value
2794 * from an attribute with an empty string value. For
2795 * unquoted attributes this could look very similar.
2796 * It refers to whether an `=` follows the name.
2797 *
2798 * e.g. <div boolean-attribute empty-attribute=></div>
2799 * ΒΉ Β²
2800 * 1. Attribute `boolean-attribute` is `true`.
2801 * 2. Attribute `empty-attribute` is `""`.
2802 */
2803 if ( true === $attribute->is_true ) {
2804 return true;
2805 }
2806
2807 $raw_value = substr( $this->html, $attribute->value_starts_at, $attribute->value_length );
2808
2809 return WP_HTML_Decoder::decode_attribute( $raw_value );
2810 }
2811
2812 /**
2813 * Gets lowercase names of all attributes matching a given prefix in the current tag.
2814 *
2815 * Note that matching is case-insensitive. This is in accordance with the spec:
2816 *
2817 * > There must never be two or more attributes on
2818 * > the same start tag whose names are an ASCII
2819 * > case-insensitive match for each other.
2820 * - HTML 5 spec
2821 *
2822 * Example:
2823 *
2824 * $p = new WP_HTML_Tag_Processor( '<div data-ENABLED class="test" DATA-test-id="14">Test</div>' );
2825 * $p->next_tag( array( 'class_name' => 'test' ) ) === true;
2826 * $p->get_attribute_names_with_prefix( 'data-' ) === array( 'data-enabled', 'data-test-id' );
2827 *
2828 * $p->next_tag() === false;
2829 * $p->get_attribute_names_with_prefix( 'data-' ) === null;
2830 *
2831 * @since 6.2.0
2832 *
2833 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive
2834 *
2835 * @param string $prefix Prefix of requested attribute names.
2836 * @return array|null List of attribute names, or `null` when no tag opener is matched.
2837 */
2838 public function get_attribute_names_with_prefix( $prefix ): ?array {
2839 if (
2840 self::STATE_MATCHED_TAG !== $this->parser_state ||
2841 $this->is_closing_tag
2842 ) {
2843 return null;
2844 }
2845
2846 $comparable = strtolower( $prefix );
2847
2848 $matches = array();
2849 foreach ( array_keys( $this->attributes ) as $attr_name ) {
2850 if ( str_starts_with( $attr_name, $comparable ) ) {
2851 $matches[] = $attr_name;
2852 }
2853 }
2854 return $matches;
2855 }
2856
2857 /**
2858 * Returns the namespace of the matched token.
2859 *
2860 * @since 6.7.0
2861 *
2862 * @return string One of 'html', 'math', or 'svg'.
2863 */
2864 public function get_namespace(): string {
2865 return $this->parsing_namespace;
2866 }
2867
2868 /**
2869 * Returns the uppercase name of the matched tag.
2870 *
2871 * Example:
2872 *
2873 * $p = new WP_HTML_Tag_Processor( '<div class="test">Test</div>' );
2874 * $p->next_tag() === true;
2875 * $p->get_tag() === 'DIV';
2876 *
2877 * $p->next_tag() === false;
2878 * $p->get_tag() === null;
2879 *
2880 * @since 6.2.0
2881 *
2882 * @return string|null Name of currently matched tag in input HTML, or `null` if none found.
2883 */
2884 public function get_tag(): ?string {
2885 if ( null === $this->tag_name_starts_at ) {
2886 return null;
2887 }
2888
2889 $tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length );
2890
2891 if ( self::STATE_MATCHED_TAG === $this->parser_state ) {
2892 return strtoupper( $tag_name );
2893 }
2894
2895 if (
2896 self::STATE_COMMENT === $this->parser_state &&
2897 self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type()
2898 ) {
2899 return $tag_name;
2900 }
2901
2902 return null;
2903 }
2904
2905 /**
2906 * Returns the adjusted tag name for a given token, taking into
2907 * account the current parsing context, whether HTML, SVG, or MathML.
2908 *
2909 * @since 6.7.0
2910 *
2911 * @return string|null Name of current tag name.
2912 */
2913 public function get_qualified_tag_name(): ?string {
2914 $tag_name = $this->get_tag();
2915 if ( null === $tag_name ) {
2916 return null;
2917 }
2918
2919 if ( 'html' === $this->get_namespace() ) {
2920 return $tag_name;
2921 }
2922
2923 $lower_tag_name = strtolower( $tag_name );
2924 if ( 'math' === $this->get_namespace() ) {
2925 return $lower_tag_name;
2926 }
2927
2928 if ( 'svg' === $this->get_namespace() ) {
2929 switch ( $lower_tag_name ) {
2930 case 'altglyph':
2931 return 'altGlyph';
2932
2933 case 'altglyphdef':
2934 return 'altGlyphDef';
2935
2936 case 'altglyphitem':
2937 return 'altGlyphItem';
2938
2939 case 'animatecolor':
2940 return 'animateColor';
2941
2942 case 'animatemotion':
2943 return 'animateMotion';
2944
2945 case 'animatetransform':
2946 return 'animateTransform';
2947
2948 case 'clippath':
2949 return 'clipPath';
2950
2951 case 'feblend':
2952 return 'feBlend';
2953
2954 case 'fecolormatrix':
2955 return 'feColorMatrix';
2956
2957 case 'fecomponenttransfer':
2958 return 'feComponentTransfer';
2959
2960 case 'fecomposite':
2961 return 'feComposite';
2962
2963 case 'feconvolvematrix':
2964 return 'feConvolveMatrix';
2965
2966 case 'fediffuselighting':
2967 return 'feDiffuseLighting';
2968
2969 case 'fedisplacementmap':
2970 return 'feDisplacementMap';
2971
2972 case 'fedistantlight':
2973 return 'feDistantLight';
2974
2975 case 'fedropshadow':
2976 return 'feDropShadow';
2977
2978 case 'feflood':
2979 return 'feFlood';
2980
2981 case 'fefunca':
2982 return 'feFuncA';
2983
2984 case 'fefuncb':
2985 return 'feFuncB';
2986
2987 case 'fefuncg':
2988 return 'feFuncG';
2989
2990 case 'fefuncr':
2991 return 'feFuncR';
2992
2993 case 'fegaussianblur':
2994 return 'feGaussianBlur';
2995
2996 case 'feimage':
2997 return 'feImage';
2998
2999 case 'femerge':
3000 return 'feMerge';
3001
3002 case 'femergenode':
3003 return 'feMergeNode';
3004
3005 case 'femorphology':
3006 return 'feMorphology';
3007
3008 case 'feoffset':
3009 return 'feOffset';
3010
3011 case 'fepointlight':
3012 return 'fePointLight';
3013
3014 case 'fespecularlighting':
3015 return 'feSpecularLighting';
3016
3017 case 'fespotlight':
3018 return 'feSpotLight';
3019
3020 case 'fetile':
3021 return 'feTile';
3022
3023 case 'feturbulence':
3024 return 'feTurbulence';
3025
3026 case 'foreignobject':
3027 return 'foreignObject';
3028
3029 case 'glyphref':
3030 return 'glyphRef';
3031
3032 case 'lineargradient':
3033 return 'linearGradient';
3034
3035 case 'radialgradient':
3036 return 'radialGradient';
3037
3038 case 'textpath':
3039 return 'textPath';
3040
3041 default:
3042 return $lower_tag_name;
3043 }
3044 }
3045
3046 // This unnecessary return prevents tools from inaccurately reporting type errors.
3047 return $tag_name;
3048 }
3049
3050 /**
3051 * Returns the adjusted attribute name for a given attribute, taking into
3052 * account the current parsing context, whether HTML, SVG, or MathML.
3053 *
3054 * @since 6.7.0
3055 *
3056 * @param string $attribute_name Which attribute to adjust.
3057 *
3058 * @return string|null
3059 */
3060 public function get_qualified_attribute_name( $attribute_name ): ?string {
3061 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
3062 return null;
3063 }
3064
3065 $namespace = $this->get_namespace();
3066 $lower_name = strtolower( $attribute_name );
3067
3068 if ( 'math' === $namespace && 'definitionurl' === $lower_name ) {
3069 return 'definitionURL';
3070 }
3071
3072 if ( 'svg' === $this->get_namespace() ) {
3073 switch ( $lower_name ) {
3074 case 'attributename':
3075 return 'attributeName';
3076
3077 case 'attributetype':
3078 return 'attributeType';
3079
3080 case 'basefrequency':
3081 return 'baseFrequency';
3082
3083 case 'baseprofile':
3084 return 'baseProfile';
3085
3086 case 'calcmode':
3087 return 'calcMode';
3088
3089 case 'clippathunits':
3090 return 'clipPathUnits';
3091
3092 case 'diffuseconstant':
3093 return 'diffuseConstant';
3094
3095 case 'edgemode':
3096 return 'edgeMode';
3097
3098 case 'filterunits':
3099 return 'filterUnits';
3100
3101 case 'glyphref':
3102 return 'glyphRef';
3103
3104 case 'gradienttransform':
3105 return 'gradientTransform';
3106
3107 case 'gradientunits':
3108 return 'gradientUnits';
3109
3110 case 'kernelmatrix':
3111 return 'kernelMatrix';
3112
3113 case 'kernelunitlength':
3114 return 'kernelUnitLength';
3115
3116 case 'keypoints':
3117 return 'keyPoints';
3118
3119 case 'keysplines':
3120 return 'keySplines';
3121
3122 case 'keytimes':
3123 return 'keyTimes';
3124
3125 case 'lengthadjust':
3126 return 'lengthAdjust';
3127
3128 case 'limitingconeangle':
3129 return 'limitingConeAngle';
3130
3131 case 'markerheight':
3132 return 'markerHeight';
3133
3134 case 'markerunits':
3135 return 'markerUnits';
3136
3137 case 'markerwidth':
3138 return 'markerWidth';
3139
3140 case 'maskcontentunits':
3141 return 'maskContentUnits';
3142
3143 case 'maskunits':
3144 return 'maskUnits';
3145
3146 case 'numoctaves':
3147 return 'numOctaves';
3148
3149 case 'pathlength':
3150 return 'pathLength';
3151
3152 case 'patterncontentunits':
3153 return 'patternContentUnits';
3154
3155 case 'patterntransform':
3156 return 'patternTransform';
3157
3158 case 'patternunits':
3159 return 'patternUnits';
3160
3161 case 'pointsatx':
3162 return 'pointsAtX';
3163
3164 case 'pointsaty':
3165 return 'pointsAtY';
3166
3167 case 'pointsatz':
3168 return 'pointsAtZ';
3169
3170 case 'preservealpha':
3171 return 'preserveAlpha';
3172
3173 case 'preserveaspectratio':
3174 return 'preserveAspectRatio';
3175
3176 case 'primitiveunits':
3177 return 'primitiveUnits';
3178
3179 case 'refx':
3180 return 'refX';
3181
3182 case 'refy':
3183 return 'refY';
3184
3185 case 'repeatcount':
3186 return 'repeatCount';
3187
3188 case 'repeatdur':
3189 return 'repeatDur';
3190
3191 case 'requiredextensions':
3192 return 'requiredExtensions';
3193
3194 case 'requiredfeatures':
3195 return 'requiredFeatures';
3196
3197 case 'specularconstant':
3198 return 'specularConstant';
3199
3200 case 'specularexponent':
3201 return 'specularExponent';
3202
3203 case 'spreadmethod':
3204 return 'spreadMethod';
3205
3206 case 'startoffset':
3207 return 'startOffset';
3208
3209 case 'stddeviation':
3210 return 'stdDeviation';
3211
3212 case 'stitchtiles':
3213 return 'stitchTiles';
3214
3215 case 'surfacescale':
3216 return 'surfaceScale';
3217
3218 case 'systemlanguage':
3219 return 'systemLanguage';
3220
3221 case 'tablevalues':
3222 return 'tableValues';
3223
3224 case 'targetx':
3225 return 'targetX';
3226
3227 case 'targety':
3228 return 'targetY';
3229
3230 case 'textlength':
3231 return 'textLength';
3232
3233 case 'viewbox':
3234 return 'viewBox';
3235
3236 case 'viewtarget':
3237 return 'viewTarget';
3238
3239 case 'xchannelselector':
3240 return 'xChannelSelector';
3241
3242 case 'ychannelselector':
3243 return 'yChannelSelector';
3244
3245 case 'zoomandpan':
3246 return 'zoomAndPan';
3247 }
3248 }
3249
3250 if ( 'html' !== $namespace ) {
3251 switch ( $lower_name ) {
3252 case 'xlink:actuate':
3253 return 'xlink actuate';
3254
3255 case 'xlink:arcrole':
3256 return 'xlink arcrole';
3257
3258 case 'xlink:href':
3259 return 'xlink href';
3260
3261 case 'xlink:role':
3262 return 'xlink role';
3263
3264 case 'xlink:show':
3265 return 'xlink show';
3266
3267 case 'xlink:title':
3268 return 'xlink title';
3269
3270 case 'xlink:type':
3271 return 'xlink type';
3272
3273 case 'xml:lang':
3274 return 'xml lang';
3275
3276 case 'xml:space':
3277 return 'xml space';
3278
3279 case 'xmlns':
3280 return 'xmlns';
3281
3282 case 'xmlns:xlink':
3283 return 'xmlns xlink';
3284 }
3285 }
3286
3287 return $attribute_name;
3288 }
3289
3290 /**
3291 * Indicates if the currently matched tag contains the self-closing flag.
3292 *
3293 * No HTML elements ought to have the self-closing flag and for those, the self-closing
3294 * flag will be ignored. For void elements this is benign because they "self close"
3295 * automatically. For non-void HTML elements though problems will appear if someone
3296 * intends to use a self-closing element in place of that element with an empty body.
3297 * For HTML foreign elements and custom elements the self-closing flag determines if
3298 * they self-close or not.
3299 *
3300 * This function does not determine if a tag is self-closing,
3301 * but only if the self-closing flag is present in the syntax.
3302 *
3303 * @since 6.3.0
3304 *
3305 * @return bool Whether the currently matched tag contains the self-closing flag.
3306 */
3307 public function has_self_closing_flag(): bool {
3308 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
3309 return false;
3310 }
3311
3312 /*
3313 * The self-closing flag is the solidus at the _end_ of the tag, not the beginning.
3314 *
3315 * Example:
3316 *
3317 * <figure />
3318 * ^ this appears one character before the end of the closing ">".
3319 */
3320 return '/' === $this->html[ $this->token_starts_at + $this->token_length - 2 ];
3321 }
3322
3323 /**
3324 * Indicates if the current tag token is a tag closer.
3325 *
3326 * Example:
3327 *
3328 * $p = new WP_HTML_Tag_Processor( '<div></div>' );
3329 * $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) );
3330 * $p->is_tag_closer() === false;
3331 *
3332 * $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) );
3333 * $p->is_tag_closer() === true;
3334 *
3335 * @since 6.2.0
3336 * @since 6.7.0 Reports all BR tags as opening tags.
3337 *
3338 * @return bool Whether the current tag is a tag closer.
3339 */
3340 public function is_tag_closer(): bool {
3341 return (
3342 self::STATE_MATCHED_TAG === $this->parser_state &&
3343 $this->is_closing_tag &&
3344
3345 /*
3346 * The BR tag can only exist as an opening tag. If something like `</br>`
3347 * appears then the HTML parser will treat it as an opening tag with no
3348 * attributes. The BR tag is unique in this way.
3349 *
3350 * @see https://html.spec.whatwg.org/#parsing-main-inbody
3351 */
3352 'BR' !== $this->get_tag()
3353 );
3354 }
3355
3356 /**
3357 * Indicates the kind of matched token, if any.
3358 *
3359 * This differs from `get_token_name()` in that it always
3360 * returns a static string indicating the type, whereas
3361 * `get_token_name()` may return values derived from the
3362 * token itself, such as a tag name or processing
3363 * instruction tag.
3364 *
3365 * Possible values:
3366 * - `#tag` when matched on a tag.
3367 * - `#text` when matched on a text node.
3368 * - `#cdata-section` when matched on a CDATA node.
3369 * - `#comment` when matched on a comment.
3370 * - `#doctype` when matched on a DOCTYPE declaration.
3371 * - `#presumptuous-tag` when matched on an empty tag closer.
3372 * - `#funky-comment` when matched on a funky comment.
3373 *
3374 * @since 6.5.0
3375 *
3376 * @return string|null What kind of token is matched, or null.
3377 */
3378 public function get_token_type(): ?string {
3379 switch ( $this->parser_state ) {
3380 case self::STATE_MATCHED_TAG:
3381 return '#tag';
3382
3383 case self::STATE_DOCTYPE:
3384 return '#doctype';
3385
3386 default:
3387 return $this->get_token_name();
3388 }
3389 }
3390
3391 /**
3392 * Returns the node name represented by the token.
3393 *
3394 * This matches the DOM API value `nodeName`. Some values
3395 * are static, such as `#text` for a text node, while others
3396 * are dynamically generated from the token itself.
3397 *
3398 * Dynamic names:
3399 * - Uppercase tag name for tag matches.
3400 * - `html` for DOCTYPE declarations.
3401 *
3402 * Note that if the Tag Processor is not matched on a token
3403 * then this function will return `null`, either because it
3404 * hasn't yet found a token or because it reached the end
3405 * of the document without matching a token.
3406 *
3407 * @since 6.5.0
3408 *
3409 * @return string|null Name of the matched token.
3410 */
3411 public function get_token_name(): ?string {
3412 switch ( $this->parser_state ) {
3413 case self::STATE_MATCHED_TAG:
3414 return $this->get_tag();
3415
3416 case self::STATE_TEXT_NODE:
3417 return '#text';
3418
3419 case self::STATE_CDATA_NODE:
3420 return '#cdata-section';
3421
3422 case self::STATE_COMMENT:
3423 return '#comment';
3424
3425 case self::STATE_DOCTYPE:
3426 return 'html';
3427
3428 case self::STATE_PRESUMPTUOUS_TAG:
3429 return '#presumptuous-tag';
3430
3431 case self::STATE_FUNKY_COMMENT:
3432 return '#funky-comment';
3433 }
3434
3435 return null;
3436 }
3437
3438 /**
3439 * Indicates what kind of comment produced the comment node.
3440 *
3441 * Because there are different kinds of HTML syntax which produce
3442 * comments, the Tag Processor tracks and exposes this as a type
3443 * for the comment. Nominally only regular HTML comments exist as
3444 * they are commonly known, but a number of unrelated syntax errors
3445 * also produce comments.
3446 *
3447 * @see self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT
3448 * @see self::COMMENT_AS_CDATA_LOOKALIKE
3449 * @see self::COMMENT_AS_INVALID_HTML
3450 * @see self::COMMENT_AS_HTML_COMMENT
3451 * @see self::COMMENT_AS_PI_NODE_LOOKALIKE
3452 *
3453 * @since 6.5.0
3454 *
3455 * @return string|null
3456 */
3457 public function get_comment_type(): ?string {
3458 if ( self::STATE_COMMENT !== $this->parser_state ) {
3459 return null;
3460 }
3461
3462 return $this->comment_type;
3463 }
3464
3465 /**
3466 * Returns the text of a matched comment or null if not on a comment type node.
3467 *
3468 * This method returns the entire text content of a comment node as it
3469 * would appear in the browser.
3470 *
3471 * This differs from {@see ::get_modifiable_text()} in that certain comment
3472 * types in the HTML API cannot allow their entire comment text content to
3473 * be modified. Namely, "bogus comments" of the form `<?not allowed in html>`
3474 * will create a comment whose text content starts with `?`. Note that if
3475 * that character were modified, it would be possible to change the node
3476 * type.
3477 *
3478 * @since 6.7.0
3479 *
3480 * @return string|null The comment text as it would appear in the browser or null
3481 * if not on a comment type node.
3482 */
3483 public function get_full_comment_text(): ?string {
3484 if ( self::STATE_FUNKY_COMMENT === $this->parser_state ) {
3485 return $this->get_modifiable_text();
3486 }
3487
3488 if ( self::STATE_COMMENT !== $this->parser_state ) {
3489 return null;
3490 }
3491
3492 switch ( $this->get_comment_type() ) {
3493 case self::COMMENT_AS_HTML_COMMENT:
3494 case self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT:
3495 return $this->get_modifiable_text();
3496
3497 case self::COMMENT_AS_CDATA_LOOKALIKE:
3498 return "[CDATA[{$this->get_modifiable_text()}]]";
3499
3500 case self::COMMENT_AS_PI_NODE_LOOKALIKE:
3501 return "?{$this->get_tag()}{$this->get_modifiable_text()}?";
3502
3503 /*
3504 * This represents "bogus comments state" from HTML tokenization.
3505 * This can be entered by `<?` or `<!`, where `?` is included in
3506 * the comment text but `!` is not.
3507 */
3508 case self::COMMENT_AS_INVALID_HTML:
3509 $preceding_character = $this->html[ $this->text_starts_at - 1 ];
3510 $comment_start = '?' === $preceding_character ? '?' : '';
3511 return "{$comment_start}{$this->get_modifiable_text()}";
3512 }
3513
3514 return null;
3515 }
3516
3517 /**
3518 * Subdivides a matched text node, splitting NULL byte sequences and decoded whitespace as
3519 * distinct nodes prefixes.
3520 *
3521 * Note that once anything that's neither a NULL byte nor decoded whitespace is
3522 * encountered, then the remainder of the text node is left intact as generic text.
3523 *
3524 * - The HTML Processor uses this to apply distinct rules for different kinds of text.
3525 * - Inter-element whitespace can be detected and skipped with this method.
3526 *
3527 * Text nodes aren't eagerly subdivided because there's no need to split them unless
3528 * decisions are being made on NULL byte sequences or whitespace-only text.
3529 *
3530 * Example:
3531 *
3532 * $processor = new WP_HTML_Tag_Processor( "\x00Apples & Oranges" );
3533 * true === $processor->next_token(); // Text is "Apples & Oranges".
3534 * true === $processor->subdivide_text_appropriately(); // Text is "".
3535 * true === $processor->next_token(); // Text is "Apples & Oranges".
3536 * false === $processor->subdivide_text_appropriately();
3537 *
3538 * $processor = new WP_HTML_Tag_Processor( "&#x13; \r\n\tMore" );
3539 * true === $processor->next_token(); // Text is "␀ ␀␉More".
3540 * true === $processor->subdivide_text_appropriately(); // Text is "␀ ␀␉".
3541 * true === $processor->next_token(); // Text is "More".
3542 * false === $processor->subdivide_text_appropriately();
3543 *
3544 * @since 6.7.0
3545 *
3546 * @return bool Whether the text node was subdivided.
3547 */
3548 public function subdivide_text_appropriately(): bool {
3549 if ( self::STATE_TEXT_NODE !== $this->parser_state ) {
3550 return false;
3551 }
3552
3553 $this->text_node_classification = self::TEXT_IS_GENERIC;
3554
3555 /*
3556 * NULL bytes are treated categorically different than numeric character
3557 * references whose number is zero. `&#x00;` is not the same as `"\x00"`.
3558 */
3559 $leading_nulls = strspn( $this->html, "\x00", $this->text_starts_at, $this->text_length );
3560 if ( $leading_nulls > 0 ) {
3561 $this->token_length = $leading_nulls;
3562 $this->text_length = $leading_nulls;
3563 $this->bytes_already_parsed = $this->token_starts_at + $leading_nulls;
3564 $this->text_node_classification = self::TEXT_IS_NULL_SEQUENCE;
3565 return true;
3566 }
3567
3568 /*
3569 * Start a decoding loop to determine the point at which the
3570 * text subdivides. This entails raw whitespace bytes and any
3571 * character reference that decodes to the same.
3572 */
3573 $at = $this->text_starts_at;
3574 $end = $this->text_starts_at + $this->text_length;
3575 while ( $at < $end ) {
3576 $skipped = strspn( $this->html, " \t\f\r\n", $at, $end - $at );
3577 $at += $skipped;
3578
3579 if ( $at < $end && '&' === $this->html[ $at ] ) {
3580 $matched_byte_length = null;
3581 $replacement = WP_HTML_Decoder::read_character_reference( 'data', $this->html, $at, $matched_byte_length );
3582 if ( isset( $replacement ) && 1 === strspn( $replacement, " \t\f\r\n" ) ) {
3583 $at += $matched_byte_length;
3584 continue;
3585 }
3586 }
3587
3588 break;
3589 }
3590
3591 if ( $at > $this->text_starts_at ) {
3592 $new_length = $at - $this->text_starts_at;
3593 $this->text_length = $new_length;
3594 $this->token_length = $new_length;
3595 $this->bytes_already_parsed = $at;
3596 $this->text_node_classification = self::TEXT_IS_WHITESPACE;
3597 return true;
3598 }
3599
3600 return false;
3601 }
3602
3603 /**
3604 * Returns the modifiable text for a matched token, or an empty string.
3605 *
3606 * Modifiable text is text content that may be read and changed without
3607 * changing the HTML structure of the document around it. This includes
3608 * the contents of `#text` nodes in the HTML as well as the inner
3609 * contents of HTML comments, Processing Instructions, and others, even
3610 * though these nodes aren't part of a parsed DOM tree. They also contain
3611 * the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any
3612 * other section in an HTML document which cannot contain HTML markup (DATA).
3613 *
3614 * If a token has no modifiable text then an empty string is returned to
3615 * avoid needless crashing or type errors. An empty string does not mean
3616 * that a token has modifiable text, and a token with modifiable text may
3617 * have an empty string (e.g. a comment with no contents).
3618 *
3619 * Limitations:
3620 *
3621 * - This function will not strip the leading newline appropriately
3622 * after seeking into a LISTING or PRE element. To ensure that the
3623 * newline is treated properly, seek to the LISTING or PRE opening
3624 * tag instead of to the first text node inside the element.
3625 *
3626 * @since 6.5.0
3627 * @since 6.7.0 Replaces NULL bytes (U+0000) and newlines appropriately.
3628 *
3629 * @return string
3630 */
3631 public function get_modifiable_text(): string {
3632 $has_enqueued_update = isset( $this->lexical_updates['modifiable text'] );
3633
3634 if ( ! $has_enqueued_update && ( null === $this->text_starts_at || 0 === $this->text_length ) ) {
3635 return '';
3636 }
3637
3638 $text = $has_enqueued_update
3639 ? $this->lexical_updates['modifiable text']->text
3640 : substr( $this->html, $this->text_starts_at, $this->text_length );
3641
3642 /*
3643 * Pre-processing the input stream would normally happen before
3644 * any parsing is done, but deferring it means it's possible to
3645 * skip in most cases. When getting the modifiable text, however
3646 * it's important to apply the pre-processing steps, which is
3647 * normalizing newlines.
3648 *
3649 * @see https://html.spec.whatwg.org/#preprocessing-the-input-stream
3650 * @see https://infra.spec.whatwg.org/#normalize-newlines
3651 */
3652 $text = str_replace( "\r\n", "\n", $text );
3653 $text = str_replace( "\r", "\n", $text );
3654
3655 // Comment data is not decoded.
3656 if (
3657 self::STATE_CDATA_NODE === $this->parser_state ||
3658 self::STATE_COMMENT === $this->parser_state ||
3659 self::STATE_DOCTYPE === $this->parser_state ||
3660 self::STATE_FUNKY_COMMENT === $this->parser_state
3661 ) {
3662 return str_replace( "\x00", "\u{FFFD}", $text );
3663 }
3664
3665 $tag_name = $this->get_token_name();
3666 if (
3667 // Script data is not decoded.
3668 'SCRIPT' === $tag_name ||
3669
3670 // RAWTEXT data is not decoded.
3671 'IFRAME' === $tag_name ||
3672 'NOEMBED' === $tag_name ||
3673 'NOFRAMES' === $tag_name ||
3674 'STYLE' === $tag_name ||
3675 'XMP' === $tag_name
3676 ) {
3677 return str_replace( "\x00", "\u{FFFD}", $text );
3678 }
3679
3680 $decoded = WP_HTML_Decoder::decode_text_node( $text );
3681
3682 /*
3683 * Skip the first line feed after LISTING, PRE, and TEXTAREA opening tags.
3684 *
3685 * Note that this first newline may come in the form of a character
3686 * reference, such as `&#x0a;`, and so it's important to perform
3687 * this transformation only after decoding the raw text content.
3688 */
3689 if (
3690 ( "\n" === ( $decoded[0] ?? '' ) ) &&
3691 ( ( $this->skip_newline_at === $this->token_starts_at && '#text' === $tag_name ) || 'TEXTAREA' === $tag_name )
3692 ) {
3693 $decoded = substr( $decoded, 1 );
3694 }
3695
3696 /*
3697 * Only in normative text nodes does the NULL byte (U+0000) get removed.
3698 * In all other contexts it's replaced by the replacement character (U+FFFD)
3699 * for security reasons (to avoid joining together strings that were safe
3700 * when separated, but not when joined).
3701 *
3702 * @todo Inside HTML integration points and MathML integration points, the
3703 * text is processed according to the insertion mode, not according
3704 * to the foreign content rules. This should strip the NULL bytes.
3705 */
3706 return ( '#text' === $tag_name && 'html' === $this->get_namespace() )
3707 ? str_replace( "\x00", '', $decoded )
3708 : str_replace( "\x00", "\u{FFFD}", $decoded );
3709 }
3710
3711 /**
3712 * Sets the modifiable text for the matched token, if matched.
3713 *
3714 * Modifiable text is text content that may be read and changed without
3715 * changing the HTML structure of the document around it. This includes
3716 * the contents of `#text` nodes in the HTML as well as the inner
3717 * contents of HTML comments, Processing Instructions, and others, even
3718 * though these nodes aren't part of a parsed DOM tree. They also contain
3719 * the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any
3720 * other section in an HTML document which cannot contain HTML markup (DATA).
3721 *
3722 * Not all modifiable text may be set by this method, and not all content
3723 * may be set as modifiable text. In the case that this fails it will return
3724 * `false` indicating as much. For instance, it will not allow inserting the
3725 * string `</script` into a SCRIPT element, because the rules for escaping
3726 * that safely are complicated. Similarly, it will not allow setting content
3727 * into a comment which would prematurely terminate the comment.
3728 *
3729 * Example:
3730 *
3731 * // Add a preface to all STYLE contents.
3732 * while ( $processor->next_tag( 'STYLE' ) ) {
3733 * $style = $processor->get_modifiable_text();
3734 * $processor->set_modifiable_text( "// Made with love on the World Wide Web\n{$style}" );
3735 * }
3736 *
3737 * // Replace smiley text with Emoji smilies.
3738 * while ( $processor->next_token() ) {
3739 * if ( '#text' !== $processor->get_token_name() ) {
3740 * continue;
3741 * }
3742 *
3743 * $chunk = $processor->get_modifiable_text();
3744 * if ( ! str_contains( $chunk, ':)' ) ) {
3745 * continue;
3746 * }
3747 *
3748 * $processor->set_modifiable_text( str_replace( ':)', 'πŸ™‚', $chunk ) );
3749 * }
3750 *
3751 * This function handles all necessary HTML encoding. Provide normal, unescaped string values.
3752 * The HTML API will encode the strings appropriately so that the browser will interpret them
3753 * as the intended value.
3754 *
3755 * Example:
3756 *
3757 * // Renders as β€œEggs & Milk” in a browser, encoded as `<p>Eggs &amp; Milk</p>`.
3758 * $processor->set_modifiable_text( 'Eggs & Milk' );
3759 *
3760 * // Renders as β€œEggs &amp; Milk” in a browser, encoded as `<p>Eggs &amp;amp; Milk</p>`.
3761 * $processor->set_modifiable_text( 'Eggs &amp; Milk' );
3762 *
3763 * @since 6.7.0
3764 * @since 6.9.0 Escapes all character references instead of trying to avoid double-escaping.
3765 *
3766 * @param string $plaintext_content New text content to represent in the matched token.
3767 * @return bool Whether the text was able to update.
3768 */
3769 public function set_modifiable_text( string $plaintext_content ): bool {
3770 if ( self::STATE_TEXT_NODE === $this->parser_state ) {
3771 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement(
3772 $this->text_starts_at,
3773 $this->text_length,
3774 strtr(
3775 $plaintext_content,
3776 array(
3777 '<' => '&lt;',
3778 '>' => '&gt;',
3779 '&' => '&amp;',
3780 '"' => '&quot;',
3781 "'" => '&apos;',
3782 )
3783 )
3784 );
3785
3786 return true;
3787 }
3788
3789 // Comment data is not encoded.
3790 if (
3791 self::STATE_COMMENT === $this->parser_state &&
3792 self::COMMENT_AS_HTML_COMMENT === $this->comment_type
3793 ) {
3794 // Check if the text could close the comment.
3795 if ( 1 === preg_match( '/--!?>/', $plaintext_content ) ) {
3796 return false;
3797 }
3798
3799 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement(
3800 $this->text_starts_at,
3801 $this->text_length,
3802 $plaintext_content
3803 );
3804
3805 return true;
3806 }
3807
3808 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
3809 return false;
3810 }
3811
3812 switch ( $this->get_tag() ) {
3813 case 'SCRIPT':
3814 /**
3815 * This is over-protective, but ensures the update doesn't break
3816 * the HTML structure of the SCRIPT element.
3817 *
3818 * More thorough analysis could track the HTML tokenizer states
3819 * and to ensure that the SCRIPT element closes at the expected
3820 * SCRIPT close tag as is done in {@see ::skip_script_data()}.
3821 *
3822 * A SCRIPT element could be closed prematurely by contents
3823 * like `</script>`. A SCRIPT element could be prevented from
3824 * closing by contents like `<!--<script>`.
3825 *
3826 * The following strings are essential for dangerous content,
3827 * although they are insufficient on their own. This trade-off
3828 * prevents dangerous scripts from being sent to the browser.
3829 * It is also unlikely to produce HTML that may confuse more
3830 * basic HTML tooling.
3831 */
3832 if (
3833 false !== stripos( $plaintext_content, '</script' ) ||
3834 false !== stripos( $plaintext_content, '<script' )
3835 ) {
3836 return false;
3837 }
3838
3839 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement(
3840 $this->text_starts_at,
3841 $this->text_length,
3842 $plaintext_content
3843 );
3844
3845 return true;
3846
3847 case 'STYLE':
3848 $plaintext_content = preg_replace_callback(
3849 '~</(?P<TAG_NAME>style)~i',
3850 static function ( $tag_match ) {
3851 return "\\3c\\2f{$tag_match['TAG_NAME']}";
3852 },
3853 $plaintext_content
3854 );
3855
3856 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement(
3857 $this->text_starts_at,
3858 $this->text_length,
3859 $plaintext_content
3860 );
3861
3862 return true;
3863
3864 case 'TEXTAREA':
3865 case 'TITLE':
3866 $plaintext_content = preg_replace_callback(
3867 "~</(?P<TAG_NAME>{$this->get_tag()})~i",
3868 static function ( $tag_match ) {
3869 return "&lt;/{$tag_match['TAG_NAME']}";
3870 },
3871 $plaintext_content
3872 );
3873
3874 /*
3875 * These don't _need_ to be escaped, but since they are decoded it's
3876 * safe to leave them escaped and this can prevent other code from
3877 * naively detecting tags within the contents.
3878 *
3879 * @todo It would be useful to prefix a multiline replacement text
3880 * with a newline, but not necessary. This is for aesthetics.
3881 */
3882 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement(
3883 $this->text_starts_at,
3884 $this->text_length,
3885 $plaintext_content
3886 );
3887
3888 return true;
3889 }
3890
3891 return false;
3892 }
3893
3894 /**
3895 * Updates or creates a new attribute on the currently matched tag with the passed value.
3896 *
3897 * This function handles all necessary HTML encoding. Provide normal, unescaped string values.
3898 * The HTML API will encode the strings appropriately so that the browser will interpret them
3899 * as the intended value.
3900 *
3901 * Example:
3902 *
3903 * // Renders β€œEggs & Milk” in a browser, encoded as `<abbr title="Eggs &amp; Milk">`.
3904 * $processor->set_attribute( 'title', 'Eggs & Milk' );
3905 *
3906 * // Renders β€œEggs &amp; Milk” in a browser, encoded as `<abbr title="Eggs &amp;amp; Milk">`.
3907 * $processor->set_attribute( 'title', 'Eggs &amp; Milk' );
3908 *
3909 * // Renders `true` as `<abbr title>`.
3910 * $processor->set_attribute( 'title', true );
3911 *
3912 * // Renders without the attribute for `false` as `<abbr>`.
3913 * $processor->set_attribute( 'title', false );
3914 *
3915 * Special handling is provided for boolean attribute values:
3916 * - When `true` is passed as the value, then only the attribute name is added to the tag.
3917 * - When `false` is passed, the attribute gets removed if it existed before.
3918 *
3919 * @since 6.2.0
3920 * @since 6.2.1 Fix: Only create a single update for multiple calls with case-variant attribute names.
3921 * @since 6.9.0 Escapes all character references instead of trying to avoid double-escaping.
3922 *
3923 * @param string $name The attribute name to target.
3924 * @param string|bool $value The new attribute value.
3925 * @return bool Whether an attribute value was set.
3926 */
3927 public function set_attribute( $name, $value ): bool {
3928 if (
3929 self::STATE_MATCHED_TAG !== $this->parser_state ||
3930 $this->is_closing_tag
3931 ) {
3932 return false;
3933 }
3934
3935 $name_length = strlen( $name );
3936
3937 /**
3938 * WordPress rejects more characters than are strictly forbidden
3939 * in HTML5. This is to prevent additional security risks deeper
3940 * in the WordPress and plugin stack. Specifically the following
3941 * are not allowed to be set as part of an HTML attribute name:
3942 *
3943 * - greater-than β€œ>”
3944 * - ampersand β€œ&”
3945 *
3946 * @see https://html.spec.whatwg.org/#attributes-2
3947 */
3948 if (
3949 0 === $name_length ||
3950 // Syntax-like characters.
3951 strcspn( $name, '"\'>&</ =' ) !== $name_length ||
3952 // Control characters.
3953 strcspn(
3954 $name,
3955 "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0A\x0B\x0C\x0D\x0E\x0F" .
3956 "\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F"
3957 ) !== $name_length ||
3958 // Unicode noncharacters.
3959 wp_has_noncharacters( $name )
3960 ) {
3961 _doing_it_wrong(
3962 __METHOD__,
3963 __( 'Invalid attribute name.' ),
3964 '6.2.0'
3965 );
3966
3967 return false;
3968 }
3969
3970 /*
3971 * > The values "true" and "false" are not allowed on boolean attributes.
3972 * > To represent a false value, the attribute has to be omitted altogether.
3973 * - HTML5 spec, https://html.spec.whatwg.org/#boolean-attributes
3974 */
3975 if ( false === $value ) {
3976 return $this->remove_attribute( $name );
3977 }
3978
3979 if ( true === $value ) {
3980 $updated_attribute = $name;
3981 } else {
3982 $comparable_name = strtolower( $name );
3983
3984 /**
3985 * Escape attribute values appropriately.
3986 *
3987 * @see https://html.spec.whatwg.org/#attributes-3
3988 */
3989 $escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true )
3990 ? esc_url( $value )
3991 : strtr(
3992 $value,
3993 array(
3994 '<' => '&lt;',
3995 '>' => '&gt;',
3996 '&' => '&amp;',
3997 '"' => '&quot;',
3998 "'" => '&apos;',
3999 )
4000 );
4001
4002 // If the escaping functions wiped out the update, reject it and indicate it was rejected.
4003 if ( '' === $escaped_new_value && '' !== $value ) {
4004 return false;
4005 }
4006
4007 $updated_attribute = "{$name}=\"{$escaped_new_value}\"";
4008 }
4009
4010 /*
4011 * > There must never be two or more attributes on
4012 * > the same start tag whose names are an ASCII
4013 * > case-insensitive match for each other.
4014 * - HTML 5 spec
4015 *
4016 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive
4017 */
4018 $comparable_name = strtolower( $name );
4019
4020 if ( isset( $this->attributes[ $comparable_name ] ) ) {
4021 /*
4022 * Update an existing attribute.
4023 *
4024 * Example – set attribute id to "new" in <div id="initial_id" />:
4025 *
4026 * <div id="initial_id"/>
4027 * ^-------------^
4028 * start end
4029 * replacement: `id="new"`
4030 *
4031 * Result: <div id="new"/>
4032 */
4033 $existing_attribute = $this->attributes[ $comparable_name ];
4034 $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement(
4035 $existing_attribute->start,
4036 $existing_attribute->length,
4037 $updated_attribute
4038 );
4039 } else {
4040 /*
4041 * Create a new attribute at the tag's name end.
4042 *
4043 * Example – add attribute id="new" to <div />:
4044 *
4045 * <div/>
4046 * ^
4047 * start and end
4048 * replacement: ` id="new"`
4049 *
4050 * Result: <div id="new"/>
4051 */
4052 $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement(
4053 $this->tag_name_starts_at + $this->tag_name_length,
4054 0,
4055 ' ' . $updated_attribute
4056 );
4057 }
4058
4059 /*
4060 * Any calls to update the `class` attribute directly should wipe out any
4061 * enqueued class changes from `add_class` and `remove_class`.
4062 */
4063 if ( 'class' === $comparable_name && ! empty( $this->classname_updates ) ) {
4064 $this->classname_updates = array();
4065 }
4066
4067 return true;
4068 }
4069
4070 /**
4071 * Remove an attribute from the currently-matched tag.
4072 *
4073 * @since 6.2.0
4074 *
4075 * @param string $name The attribute name to remove.
4076 * @return bool Whether an attribute was removed.
4077 */
4078 public function remove_attribute( $name ): bool {
4079 if (
4080 self::STATE_MATCHED_TAG !== $this->parser_state ||
4081 $this->is_closing_tag
4082 ) {
4083 return false;
4084 }
4085
4086 /*
4087 * > There must never be two or more attributes on
4088 * > the same start tag whose names are an ASCII
4089 * > case-insensitive match for each other.
4090 * - HTML 5 spec
4091 *
4092 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive
4093 */
4094 $name = strtolower( $name );
4095
4096 /*
4097 * Any calls to update the `class` attribute directly should wipe out any
4098 * enqueued class changes from `add_class` and `remove_class`.
4099 */
4100 if ( 'class' === $name && count( $this->classname_updates ) !== 0 ) {
4101 $this->classname_updates = array();
4102 }
4103
4104 /*
4105 * If updating an attribute that didn't exist in the input
4106 * document, then remove the enqueued update and move on.
4107 *
4108 * For example, this might occur when calling `remove_attribute()`
4109 * after calling `set_attribute()` for the same attribute
4110 * and when that attribute wasn't originally present.
4111 */
4112 if ( ! isset( $this->attributes[ $name ] ) ) {
4113 if ( isset( $this->lexical_updates[ $name ] ) ) {
4114 unset( $this->lexical_updates[ $name ] );
4115 }
4116 return false;
4117 }
4118
4119 /*
4120 * Removes an existing tag attribute.
4121 *
4122 * Example – remove the attribute id from <div id="main"/>:
4123 * <div id="initial_id"/>
4124 * ^-------------^
4125 * start end
4126 * replacement: ``
4127 *
4128 * Result: <div />
4129 */
4130 $this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement(
4131 $this->attributes[ $name ]->start,
4132 $this->attributes[ $name ]->length,
4133 ''
4134 );
4135
4136 // Removes any duplicated attributes if they were also present.
4137 foreach ( $this->duplicate_attributes[ $name ] ?? array() as $attribute_token ) {
4138 $this->lexical_updates[] = new WP_HTML_Text_Replacement(
4139 $attribute_token->start,
4140 $attribute_token->length,
4141 ''
4142 );
4143 }
4144
4145 return true;
4146 }
4147
4148 /**
4149 * Adds a new class name to the currently matched tag.
4150 *
4151 * @since 6.2.0
4152 *
4153 * @param string $class_name The class name to add.
4154 * @return bool Whether the class was set to be added.
4155 */
4156 public function add_class( $class_name ): bool {
4157 if (
4158 self::STATE_MATCHED_TAG !== $this->parser_state ||
4159 $this->is_closing_tag
4160 ) {
4161 return false;
4162 }
4163
4164 if ( self::QUIRKS_MODE !== $this->compat_mode ) {
4165 $this->classname_updates[ $class_name ] = self::ADD_CLASS;
4166 return true;
4167 }
4168
4169 /*
4170 * Because class names are matched ASCII-case-insensitively in quirks mode,
4171 * this needs to see if a case variant of the given class name is already
4172 * enqueued and update that existing entry, if so. This picks the casing of
4173 * the first-provided class name for all lexical variations.
4174 */
4175 $class_name_length = strlen( $class_name );
4176 foreach ( $this->classname_updates as $updated_name => $action ) {
4177 if (
4178 strlen( $updated_name ) === $class_name_length &&
4179 0 === substr_compare( $updated_name, $class_name, 0, $class_name_length, true )
4180 ) {
4181 $this->classname_updates[ $updated_name ] = self::ADD_CLASS;
4182 return true;
4183 }
4184 }
4185
4186 $this->classname_updates[ $class_name ] = self::ADD_CLASS;
4187 return true;
4188 }
4189
4190 /**
4191 * Removes a class name from the currently matched tag.
4192 *
4193 * @since 6.2.0
4194 *
4195 * @param string $class_name The class name to remove.
4196 * @return bool Whether the class was set to be removed.
4197 */
4198 public function remove_class( $class_name ): bool {
4199 if (
4200 self::STATE_MATCHED_TAG !== $this->parser_state ||
4201 $this->is_closing_tag
4202 ) {
4203 return false;
4204 }
4205
4206 if ( self::QUIRKS_MODE !== $this->compat_mode ) {
4207 $this->classname_updates[ $class_name ] = self::REMOVE_CLASS;
4208 return true;
4209 }
4210
4211 /*
4212 * Because class names are matched ASCII-case-insensitively in quirks mode,
4213 * this needs to see if a case variant of the given class name is already
4214 * enqueued and update that existing entry, if so. This picks the casing of
4215 * the first-provided class name for all lexical variations.
4216 */
4217 $class_name_length = strlen( $class_name );
4218 foreach ( $this->classname_updates as $updated_name => $action ) {
4219 if (
4220 strlen( $updated_name ) === $class_name_length &&
4221 0 === substr_compare( $updated_name, $class_name, 0, $class_name_length, true )
4222 ) {
4223 $this->classname_updates[ $updated_name ] = self::REMOVE_CLASS;
4224 return true;
4225 }
4226 }
4227
4228 $this->classname_updates[ $class_name ] = self::REMOVE_CLASS;
4229 return true;
4230 }
4231
4232 /**
4233 * Returns the string representation of the HTML Tag Processor.
4234 *
4235 * @since 6.2.0
4236 *
4237 * @see WP_HTML_Tag_Processor::get_updated_html()
4238 *
4239 * @return string The processed HTML.
4240 */
4241 public function __toString(): string {
4242 return $this->get_updated_html();
4243 }
4244
4245 /**
4246 * Returns the string representation of the HTML Tag Processor.
4247 *
4248 * @since 6.2.0
4249 * @since 6.2.1 Shifts the internal cursor corresponding to the applied updates.
4250 * @since 6.4.0 No longer calls subclass method `next_tag()` after updating HTML.
4251 *
4252 * @return string The processed HTML.
4253 */
4254 public function get_updated_html(): string {
4255 $requires_no_updating = 0 === count( $this->classname_updates ) && 0 === count( $this->lexical_updates );
4256
4257 /*
4258 * When there is nothing more to update and nothing has already been
4259 * updated, return the original document and avoid a string copy.
4260 */
4261 if ( $requires_no_updating ) {
4262 return $this->html;
4263 }
4264
4265 /*
4266 * Keep track of the position right before the current tag. This will
4267 * be necessary for reparsing the current tag after updating the HTML.
4268 */
4269 $before_current_tag = $this->token_starts_at ?? 0;
4270
4271 /*
4272 * 1. Apply the enqueued edits and update all the pointers to reflect those changes.
4273 */
4274 $this->class_name_updates_to_attributes_updates();
4275 $before_current_tag += $this->apply_attributes_updates( $before_current_tag );
4276
4277 /*
4278 * 2. Rewind to before the current tag and reparse to get updated attributes.
4279 *
4280 * At this point the internal cursor points to the end of the tag name.
4281 * Rewind before the tag name starts so that it's as if the cursor didn't
4282 * move; a call to `next_tag()` will reparse the recently-updated attributes
4283 * and additional calls to modify the attributes will apply at this same
4284 * location, but in order to avoid issues with subclasses that might add
4285 * behaviors to `next_tag()`, the internal methods should be called here
4286 * instead.
4287 *
4288 * It's important to note that in this specific place there will be no change
4289 * because the processor was already at a tag when this was called and it's
4290 * rewinding only to the beginning of this very tag before reprocessing it
4291 * and its attributes.
4292 *
4293 * <p>Previous HTML<em>More HTML</em></p>
4294 * ↑ β”‚ back up by the length of the tag name plus the opening <
4295 * β””β†β”€β”˜ back up by strlen("em") + 1 ==> 3
4296 */
4297 $this->bytes_already_parsed = $before_current_tag;
4298 $this->base_class_next_token();
4299
4300 return $this->html;
4301 }
4302
4303 /**
4304 * Parses tag query input into internal search criteria.
4305 *
4306 * @since 6.2.0
4307 *
4308 * @param array|string|null $query {
4309 * Optional. Which tag name to find, having which class, etc. Default is to find any tag.
4310 *
4311 * @type string|null $tag_name Which tag to find, or `null` for "any tag."
4312 * @type int|null $match_offset Find the Nth tag matching all search criteria.
4313 * 1 for "first" tag, 3 for "third," etc.
4314 * Defaults to first tag.
4315 * @type string|null $class_name Tag must contain this class name to match.
4316 * @type string $tag_closers "visit" or "skip": whether to stop on tag closers, e.g. </div>.
4317 * }
4318 */
4319 private function parse_query( $query ) {
4320 if ( null !== $query && $query === $this->last_query ) {
4321 return;
4322 }
4323
4324 $this->last_query = $query;
4325 $this->sought_tag_name = null;
4326 $this->sought_class_name = null;
4327 $this->sought_match_offset = 1;
4328 $this->stop_on_tag_closers = false;
4329
4330 // A single string value means "find the tag of this name".
4331 if ( is_string( $query ) ) {
4332 $this->sought_tag_name = $query;
4333 return;
4334 }
4335
4336 // An empty query parameter applies no restrictions on the search.
4337 if ( null === $query ) {
4338 return;
4339 }
4340
4341 // If not using the string interface, an associative array is required.
4342 if ( ! is_array( $query ) ) {
4343 _doing_it_wrong(
4344 __METHOD__,
4345 __( 'The query argument must be an array or a tag name.' ),
4346 '6.2.0'
4347 );
4348 return;
4349 }
4350
4351 if ( isset( $query['tag_name'] ) && is_string( $query['tag_name'] ) ) {
4352 $this->sought_tag_name = $query['tag_name'];
4353 }
4354
4355 if ( isset( $query['class_name'] ) && is_string( $query['class_name'] ) ) {
4356 $this->sought_class_name = $query['class_name'];
4357 }
4358
4359 if ( isset( $query['match_offset'] ) && is_int( $query['match_offset'] ) && 0 < $query['match_offset'] ) {
4360 $this->sought_match_offset = $query['match_offset'];
4361 }
4362
4363 if ( isset( $query['tag_closers'] ) ) {
4364 $this->stop_on_tag_closers = 'visit' === $query['tag_closers'];
4365 }
4366 }
4367
4368
4369 /**
4370 * Checks whether a given tag and its attributes match the search criteria.
4371 *
4372 * @since 6.2.0
4373 *
4374 * @return bool Whether the given tag and its attribute match the search criteria.
4375 */
4376 private function matches(): bool {
4377 if ( $this->is_closing_tag && ! $this->stop_on_tag_closers ) {
4378 return false;
4379 }
4380
4381 // Does the tag name match the requested tag name in a case-insensitive manner?
4382 if (
4383 isset( $this->sought_tag_name ) &&
4384 (
4385 strlen( $this->sought_tag_name ) !== $this->tag_name_length ||
4386 0 !== substr_compare( $this->html, $this->sought_tag_name, $this->tag_name_starts_at, $this->tag_name_length, true )
4387 )
4388 ) {
4389 return false;
4390 }
4391
4392 if ( null !== $this->sought_class_name && ! $this->has_class( $this->sought_class_name ) ) {
4393 return false;
4394 }
4395
4396 return true;
4397 }
4398
4399 /**
4400 * Gets DOCTYPE declaration info from a DOCTYPE token.
4401 *
4402 * DOCTYPE tokens may appear in many places in an HTML document. In most places, they are
4403 * simply ignored. The main parsing functions find the basic shape of DOCTYPE tokens but
4404 * do not perform detailed parsing.
4405 *
4406 * This method can be called to perform a full parse of the DOCTYPE token and retrieve
4407 * its information.
4408 *
4409 * @return WP_HTML_Doctype_Info|null The DOCTYPE declaration information or `null` if not
4410 * currently at a DOCTYPE node.
4411 */
4412 public function get_doctype_info(): ?WP_HTML_Doctype_Info {
4413 if ( self::STATE_DOCTYPE !== $this->parser_state ) {
4414 return null;
4415 }
4416
4417 return WP_HTML_Doctype_Info::from_doctype_token( substr( $this->html, $this->token_starts_at, $this->token_length ) );
4418 }
4419
4420 /**
4421 * Parser Ready State.
4422 *
4423 * Indicates that the parser is ready to run and waiting for a state transition.
4424 * It may not have started yet, or it may have just finished parsing a token and
4425 * is ready to find the next one.
4426 *
4427 * @since 6.5.0
4428 *
4429 * @access private
4430 */
4431 const STATE_READY = 'STATE_READY';
4432
4433 /**
4434 * Parser Complete State.
4435 *
4436 * Indicates that the parser has reached the end of the document and there is
4437 * nothing left to scan. It finished parsing the last token completely.
4438 *
4439 * @since 6.5.0
4440 *
4441 * @access private
4442 */
4443 const STATE_COMPLETE = 'STATE_COMPLETE';
4444
4445 /**
4446 * Parser Incomplete Input State.
4447 *
4448 * Indicates that the parser has reached the end of the document before finishing
4449 * a token. It started parsing a token but there is a possibility that the input
4450 * HTML document was truncated in the middle of a token.
4451 *
4452 * The parser is reset at the start of the incomplete token and has paused. There
4453 * is nothing more than can be scanned unless provided a more complete document.
4454 *
4455 * @since 6.5.0
4456 *
4457 * @access private
4458 */
4459 const STATE_INCOMPLETE_INPUT = 'STATE_INCOMPLETE_INPUT';
4460
4461 /**
4462 * Parser Matched Tag State.
4463 *
4464 * Indicates that the parser has found an HTML tag and it's possible to get
4465 * the tag name and read or modify its attributes (if it's not a closing tag).
4466 *
4467 * @since 6.5.0
4468 *
4469 * @access private
4470 */
4471 const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG';
4472
4473 /**
4474 * Parser Text Node State.
4475 *
4476 * Indicates that the parser has found a text node and it's possible
4477 * to read and modify that text.
4478 *
4479 * @since 6.5.0
4480 *
4481 * @access private
4482 */
4483 const STATE_TEXT_NODE = 'STATE_TEXT_NODE';
4484
4485 /**
4486 * Parser CDATA Node State.
4487 *
4488 * Indicates that the parser has found a CDATA node and it's possible
4489 * to read and modify its modifiable text. Note that in HTML there are
4490 * no CDATA nodes outside of foreign content (SVG and MathML). Outside
4491 * of foreign content, they are treated as HTML comments.
4492 *
4493 * @since 6.5.0
4494 *
4495 * @access private
4496 */
4497 const STATE_CDATA_NODE = 'STATE_CDATA_NODE';
4498
4499 /**
4500 * Indicates that the parser has found an HTML comment and it's
4501 * possible to read and modify its modifiable text.
4502 *
4503 * @since 6.5.0
4504 *
4505 * @access private
4506 */
4507 const STATE_COMMENT = 'STATE_COMMENT';
4508
4509 /**
4510 * Indicates that the parser has found a DOCTYPE node and it's
4511 * possible to read its DOCTYPE information via `get_doctype_info()`.
4512 *
4513 * @since 6.5.0
4514 *
4515 * @access private
4516 */
4517 const STATE_DOCTYPE = 'STATE_DOCTYPE';
4518
4519 /**
4520 * Indicates that the parser has found an empty tag closer `</>`.
4521 *
4522 * Note that in HTML there are no empty tag closers, and they
4523 * are ignored. Nonetheless, the Tag Processor still
4524 * recognizes them as they appear in the HTML stream.
4525 *
4526 * These were historically discussed as a "presumptuous tag
4527 * closer," which would close the nearest open tag, but were
4528 * dismissed in favor of explicitly-closing tags.
4529 *
4530 * @since 6.5.0
4531 *
4532 * @access private
4533 */
4534 const STATE_PRESUMPTUOUS_TAG = 'STATE_PRESUMPTUOUS_TAG';
4535
4536 /**
4537 * Indicates that the parser has found a "funky comment"
4538 * and it's possible to read and modify its modifiable text.
4539 *
4540 * Example:
4541 *
4542 * </%url>
4543 * </{"wp-bit":"query/post-author"}>
4544 * </2>
4545 *
4546 * Funky comments are tag closers with invalid tag names. Note
4547 * that in HTML these are turn into bogus comments. Nonetheless,
4548 * the Tag Processor recognizes them in a stream of HTML and
4549 * exposes them for inspection and modification.
4550 *
4551 * @since 6.5.0
4552 *
4553 * @access private
4554 */
4555 const STATE_FUNKY_COMMENT = 'STATE_WP_FUNKY';
4556
4557 /**
4558 * Indicates that a comment was created when encountering abruptly-closed HTML comment.
4559 *
4560 * Example:
4561 *
4562 * <!-->
4563 * <!--->
4564 *
4565 * @since 6.5.0
4566 */
4567 const COMMENT_AS_ABRUPTLY_CLOSED_COMMENT = 'COMMENT_AS_ABRUPTLY_CLOSED_COMMENT';
4568
4569 /**
4570 * Indicates that a comment would be parsed as a CDATA node,
4571 * were HTML to allow CDATA nodes outside of foreign content.
4572 *
4573 * Example:
4574 *
4575 * <![CDATA[This is a CDATA node.]]>
4576 *
4577 * This is an HTML comment, but it looks like a CDATA node.
4578 *
4579 * @since 6.5.0
4580 */
4581 const COMMENT_AS_CDATA_LOOKALIKE = 'COMMENT_AS_CDATA_LOOKALIKE';
4582
4583 /**
4584 * Indicates that a comment was created when encountering
4585 * normative HTML comment syntax.
4586 *
4587 * Example:
4588 *
4589 * <!-- this is a comment -->
4590 *
4591 * @since 6.5.0
4592 */
4593 const COMMENT_AS_HTML_COMMENT = 'COMMENT_AS_HTML_COMMENT';
4594
4595 /**
4596 * Indicates that a comment would be parsed as a Processing
4597 * Instruction node, were they to exist within HTML.
4598 *
4599 * Example:
4600 *
4601 * <?wp __( 'Like' ) ?>
4602 *
4603 * This is an HTML comment, but it looks like a CDATA node.
4604 *
4605 * @since 6.5.0
4606 */
4607 const COMMENT_AS_PI_NODE_LOOKALIKE = 'COMMENT_AS_PI_NODE_LOOKALIKE';
4608
4609 /**
4610 * Indicates that a comment was created when encountering invalid
4611 * HTML input, a so-called "bogus comment."
4612 *
4613 * Example:
4614 *
4615 * <?nothing special>
4616 * <!{nothing special}>
4617 *
4618 * @since 6.5.0
4619 */
4620 const COMMENT_AS_INVALID_HTML = 'COMMENT_AS_INVALID_HTML';
4621
4622 /**
4623 * No-quirks mode document compatibility mode.
4624 *
4625 * > In no-quirks mode, the behavior is (hopefully) the desired behavior
4626 * > described by the modern HTML and CSS specifications.
4627 *
4628 * @see self::$compat_mode
4629 * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode
4630 *
4631 * @since 6.7.0
4632 *
4633 * @var string
4634 */
4635 const NO_QUIRKS_MODE = 'no-quirks-mode';
4636
4637 /**
4638 * Quirks mode document compatibility mode.
4639 *
4640 * > In quirks mode, layout emulates behavior in Navigator 4 and Internet
4641 * > Explorer 5. This is essential in order to support websites that were
4642 * > built before the widespread adoption of web standards.
4643 *
4644 * @see self::$compat_mode
4645 * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode
4646 *
4647 * @since 6.7.0
4648 *
4649 * @var string
4650 */
4651 const QUIRKS_MODE = 'quirks-mode';
4652
4653 /**
4654 * Indicates that a span of text may contain any combination of significant
4655 * kinds of characters: NULL bytes, whitespace, and others.
4656 *
4657 * @see self::$text_node_classification
4658 * @see self::subdivide_text_appropriately
4659 *
4660 * @since 6.7.0
4661 */
4662 const TEXT_IS_GENERIC = 'TEXT_IS_GENERIC';
4663
4664 /**
4665 * Indicates that a span of text comprises a sequence only of NULL bytes.
4666 *
4667 * @see self::$text_node_classification
4668 * @see self::subdivide_text_appropriately
4669 *
4670 * @since 6.7.0
4671 */
4672 const TEXT_IS_NULL_SEQUENCE = 'TEXT_IS_NULL_SEQUENCE';
4673
4674 /**
4675 * Indicates that a span of decoded text comprises only whitespace.
4676 *
4677 * @see self::$text_node_classification
4678 * @see self::subdivide_text_appropriately
4679 *
4680 * @since 6.7.0
4681 */
4682 const TEXT_IS_WHITESPACE = 'TEXT_IS_WHITESPACE';
4683
4684 /**
4685 * Wakeup magic method.
4686 *
4687 * @since 6.9.2
4688 */
4689 public function __wakeup() {
4690 throw new \LogicException( __CLASS__ . ' should never be unserialized' );
4691 }
4692}
4693
Ui Ux Design – Teachers Night Out https://cardgames4educators.com Wed, 16 Oct 2024 22:24:18 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://cardgames4educators.com/wp-content/uploads/2024/06/cropped-Card-4-Educators-logo-32x32.png Ui Ux Design – Teachers Night Out https://cardgames4educators.com 32 32 Masters In English How English Speaker https://cardgames4educators.com/masters-in-english-how-english-speaker/ https://cardgames4educators.com/masters-in-english-how-english-speaker/#comments Mon, 27 May 2024 08:54:45 +0000 https://themexriver.com/wp/kadu/?p=1

Erat himenaeos neque id sagittis massa. Hac suscipit pulvinar dignissim platea magnis eu. Don tellus a pharetra inceptos efficitur dui pulvinar. Feugiat facilisis penatibus pulvinar nunc dictumst donec odio platea habitasse. Lacus porta dolor purus elit ante bibendum tortor netus taciti nullam cubilia. Erat per suspendisse placerat morbi egestas pulvinar bibendum sollicitudin nec. Euismod cubilia eleifend velit himenaeos sodales lectus. Leo maximus cras ac porttitor aliquam torquent pulvinar odio volutpat parturient. Quisque risus finibus suspendisse mus purus magnis facilisi condimentum consectetur dui. Curae elit suspendisse cursus vehicula.

Turpis taciti class non vel pretium quis pulvinar tempor lobortis nunc. Libero phasellus parturient sapien volutpat malesuada ornare. Cubilia dignissim sollicitudin rhoncus lacinia maximus. Cras lorem fermentum bibendum pellentesque nisl etiam ligula enim cubilia. Vulputate pede sapien torquent montes tempus malesuada in mattis dis turpis vitae. Porta est tempor ex eget feugiat vulputate ipsum. Justo nec iaculis habitant diam arcu fermentum.

We offer comprehen sive emplo ment services such as assistance wit employer compliance.Our company is your strategic HR partner as instead of HR. john smithson

Cubilia dignissim sollicitudin rhoncus lacinia maximus. Cras lorem fermentum bibendum pellentesque nisl etiam ligula enim cubilia. Vulputate pede sapien torquent montes tempus malesuada in mattis dis turpis vitae.

Exploring Learning Landscapes in Academic

Feugiat facilisis penatibus pulvinar nunc dictumst donec odio platea habitasse. Lacus porta dolor purus elit ante bibendum tortor netus taciti nullam cubilia. Erat per suspendisse placerat morbi egestas pulvinar bibendum sollicitudin nec. Euismod cubilia eleifend velit himenaeos sodales lectus. Leo maximus cras ac porttitor aliquam torquent.

]]>
https://cardgames4educators.com/masters-in-english-how-english-speaker/feed/ 1