1<?php
2/**
3 * Efficiently scan through block structure in document without parsing
4 * the entire block tree and all of its JSON attributes into memory.
5 *
6 * @package WordPress
7 * @subpackage Blocks
8 * @since 6.9.0
9 */
10
11/**
12 * Class for efficiently scanning through block structure in a document
13 * without parsing the entire block tree and JSON attributes into memory.
14 *
15 * ## Overview
16 *
17 * This class is designed to help analyze and modify block structure in a
18 * streaming fashion and to bridge the gap between parsed block trees and
19 * the text representing them.
20 *
21 * Use-cases for this class include but are not limited to:
22 *
23 * - Counting block types in a document.
24 * - Queuing stylesheets based on the presence of various block types.
25 * - Modifying blocks of a given type, i.e. migrations, updates, and styling.
26 * - Searching for content of specific kinds, e.g. checking for blocks
27 * with certain theme support attributes, or block bindings.
28 * - Adding CSS class names to the element wrapping a block’s inner blocks.
29 *
30 * > *Note!* If a fully-parsed block tree of a document is necessary, including
31 * > all the parsed JSON attributes, nested blocks, and HTML, consider
32 * > using {@see \parse_blocks()} instead which will parse the document
33 * > in one swift pass.
34 *
35 * For typical usage, jump first to the methods {@see self::next_block()},
36 * {@see self::next_delimiter()}, or {@see self::next_token()}.
37 *
38 * ### Values
39 *
40 * As a lower-level interface than {@see parse_blocks()} this class follows
41 * different performance-focused values:
42 *
43 * - Minimize allocations so that documents of any size may be processed
44 * on a fixed or marginal amount of memory.
45 * - Make hidden costs explicit so that calling code only has to pay the
46 * performance penalty for features it needs.
47 * - Operate with a streaming and re-entrant design to make it possible
48 * to operate on chunks of a document and to resume after pausing.
49 *
50 * This means that some operations might appear more cumbersome than one
51 * might expect. This design tradeoff opens up opportunity to wrap this in
52 * a convenience class to add higher-level functionality.
53 *
54 * ## Concepts
55 *
56 * All text documents can be considered a block document containing a combination
57 * of “freeform HTML” and explicit block structure. Block structure forms through
58 * special HTML comments called _delimiters_ which include a block type and,
59 * optionally, block attributes encoded as a JSON object payload.
60 *
61 * This processor is designed to scan through a block document from delimiter to
62 * delimiter, tracking how the delimiters impact the structure of the document.
63 * Spans of HTML appear between delimiters. If these spans exist at the top level
64 * of the document, meaning there is no containing block around them, they are
65 * considered freeform HTML content. If, however, they appear _inside_ block
66 * structure they are interpreted as `innerHTML` for the containing block.
67 *
68 * ### Tokens and scanning
69 *
70 * As the processor scans through a document is reports information about the token
71 * on which is pauses. Tokens represent spans of text in the input comprising block
72 * delimiters and spans of HTML.
73 *
74 * - {@see self::next_token()} visits every contiguous subspan of text in the
75 * input document. This includes all explicit block comment delimiters and spans
76 * of HTML content (whether freeform or inner HTML).
77 * - {@see self::next_delimiter()} visits every explicit block comment delimiter
78 * unless passed a block type which covers freeform HTML content. In these cases
79 * it will stop at top-level spans of HTML and report a `null` block type.
80 * - {@see self::next_block()} visits every block delimiter which _opens_ a block.
81 * This includes opening block delimiters as well as void block delimiters. With
82 * the same exception as above for freeform HTML block types, this will visit
83 * top-level spans of HTML content.
84 *
85 * When matched on a particular token, the following methods provide structural
86 * and textual information about it:
87 *
88 * - {@see self::get_delimiter_type()} reports whether the delimiter is an opener,
89 * a closer, or if it represents a whole void block.
90 * - {@see self::get_block_type()} reports the fully-qualified block type which
91 * the delimiter represents.
92 * - {@see self::get_printable_block_type()} reports the fully-qualified block type,
93 * but returns `core/freeform` instead of `null` for top-level freeform HTML content.
94 * - {@see self::is_block_type()} indicates if the delimiter represents a block of
95 * the given block type, or wildcard or pseudo-block type described below.
96 * - {@see self::opens_block()} indicates if the delimiter opens a block of one
97 * of the provided block types. Opening, void, and top-level freeform HTML content
98 * all open blocks.
99 * - {@see static::get_attributes()} is currently reserved for a future streaming
100 * JSON parser class.
101 * - {@see self::allocate_and_return_parsed_attributes()} extracts the JSON attributes
102 * for delimiters which open blocks and return the fully-parsed attributes as an
103 * associative array. {@see static::get_last_json_error()} for when this fails.
104 * - {@see self::is_html()} indicates if the token is a span of HTML which might
105 * be top-level freeform content or a block’s inner HTML.
106 * - {@see self::get_html_content()} returns the span of HTML.
107 * - {@see self::get_span()} for the byte offset and length into the input document
108 * representing the token.
109 *
110 * It’s possible for the processor to fail to scan forward if the input document ends
111 * in a proper prefix of an explicit block comment delimiter. For example, if the input
112 * ends in `<!-- wp:` then it _might_ be the start of another delimiter. The parser
113 * cannot know, however, and therefore refuses to proceed. {@see static::get_last_error()}
114 * to distinguish between a failure to find the next token and an incomplete input.
115 *
116 * ### Block types
117 *
118 * A block’s “type” comprises an optional _namespace_ and _name_. If the namespace
119 * isn’t provided it will be interpreted as the implicit `core` namespace. For example,
120 * the type `gallery` is the name of the block in the `core` namespace, but the type
121 * `abc/gallery` is the _fully-qualified_ block type for the block whose name is still
122 * `gallery`, but in the `abc` namespace.
123 *
124 * Methods on this class are aware of this block naming semantic and anywhere a block
125 * type is an argument to a method it will be normalized to account for implicit namespaces.
126 * Passing `paragraph` is the same as passing `core/paragraph`. On the contrary, anywhere
127 * this class returns a block type, it will return the fully-qualified and normalized form.
128 * For example, for the `<!-- wp:group -->` delimiter it will return `core/group` as the
129 * block type.
130 *
131 * There are two special block types that change the behavior of the processor:
132 *
133 * - The wildcard `*` represents _any block_. In addition to matching all block types,
134 * it also represents top-level freeform HTML whose block type is reported as `null`.
135 *
136 * - The `core/freeform` block type is a pseudo-block type which explicitly matches
137 * top-level freeform HTML.
138 *
139 * These special block types can be passed into any method which searches for blocks.
140 *
141 * There is one additional special block type which may be returned from
142 * {@see self::get_printable_block_type()}. This is the `#innerHTML` type, which
143 * indicates that the HTML span on which the processor is paused is inner HTML for
144 * a containing block.
145 *
146 * ### Spans of HTML
147 *
148 * Non-block content plays a complicated role in processing block documents. This
149 * processor exposes tools to help work with these spans of HTML.
150 *
151 * - {@see self::is_html()} indicates if the processor is paused at a span of
152 * HTML but does not differentiate between top-level freeform content and inner HTML.
153 * - {@see self::is_non_whitespace_html()} indicates not only if the processor
154 * is paused at a span of HTML, but also whether that span incorporates more than
155 * whitespace characters. Because block serialization often inserts newlines between
156 * block comment delimiters, this is useful for distinguishing “real” freeform
157 * content from purely aesthetic syntax.
158 * - {@see self::is_block_type()} matches top-level freeform HTML content when
159 * provided one of the special block types described above.
160 *
161 * ### Block structure
162 *
163 * As the processor traverses block delimiters it maintains a stack of which blocks are
164 * open at the given place in the document where it’s paused. This stack represents the
165 * block structure of a document and is used to determine where blocks end, which blocks
166 * represent inner blocks, whether a span of HTML is top-level freeform content, and
167 * more. Investigate the stack with {@see self::get_breadcrumbs()}, which returns an
168 * array of block types starting at the outermost-open block and descending to the
169 * currently-visited block.
170 *
171 * Unlike {@parse_blocks()}, spans of HTML appear in this structure as the special
172 * reported block type `#html`. Such a span represents inner HTML for a block if the
173 * depth reported by {@see self::get_depth()} is greater than one.
174 *
175 * It will generally not be necessary to inspect the stack of open blocks, though
176 * depth may be important for finding where blocks end. When visiting a block opener,
177 * the depth will have been increased before pausing; in contrast the depth is
178 * decremented before visiting a closer. This makes the following an easy way to
179 * determine if a block is still open.
180 *
181 * Example:
182 *
183 * $depth = $processor->get_depth();
184 * while ( $processor->next_token() && $processor->get_depth() > $depth ) {
185 * continue
186 * }
187 * // Processor is now paused at the token immediately following the closed block.
188 *
189 * #### Extracting blocks
190 *
191 * A unique feature of this processor is the ability to return the same output as
192 * {@see \parse_blocks()} would produce, but for a subset of the input document.
193 * For example, it’s possible to extract an image block, manipulate that parsed
194 * block, and re-serialize it into the original document. It’s possible to do so
195 * while skipping over the parse of the rest of the document.
196 *
197 * {@see self::extract_full_block_and_advance()} will scan forward from the current block opener
198 * and build the parsed block structure until the current block is closed. It will
199 * include all inner HTML and inner blocks, and parse all of the inner blocks. It
200 * can be used to extract a block at any depth in the document, helpful for operating
201 * on blocks within nested structure.
202 *
203 * Example:
204 *
205 * if ( ! $processor->next_block( 'gallery' ) ) {
206 * return $post_content;
207 * }
208 *
209 * $gallery_at = $processor->get_span()->start;
210 * $gallery_block = $processor->extract_full_block_and_advance();
211 * $after_gallery = $processor->get_span()->start;
212 * return (
213 * substr( $post_content, 0, $gallery_at ) .
214 * serialize_block( modify_gallery( $gallery_block ) .
215 * substr( $post_content, $after_gallery )
216 * );
217 *
218 * #### Handling of malformed structure
219 *
220 * There are situations where closing block delimiters appear for which no open block
221 * exists, or where a document ends before a block is closed, or where a closing block
222 * delimiter appears but references a different block type than the most-recently
223 * opened block does. In all of these cases, the stack of open blocks should mirror
224 * the behavior in {@see \parse_blocks()}.
225 *
226 * Unlike {@see \parse_blocks()}, however, this processor can still operate on the
227 * invalid block delimiters. It provides a few functions which can be used for building
228 * custom and non-spec-compliant error handling.
229 *
230 * - {@see self::has_closing_flag()} indicates if the block delimiter contains the
231 * closing flag at the end. Some invalid block delimiters might contain both the
232 * void and closing flag, in which case {@see self::get_delimiter_type()} will
233 * report that it’s a void block.
234 * - {@see static::get_last_error()} indicates if the processor reached an invalid
235 * block closing. Depending on the context, {@see \parse_blocks()} might instead
236 * ignore the token or treat it as freeform HTML content.
237 *
238 * ## Static helpers
239 *
240 * This class provides helpers for performing semantic block-related operations.
241 *
242 * - {@see self::normalize_block_type()} takes a block type with or without the
243 * implicit `core` namespace and returns a fully-qualified block type.
244 * - {@see self::are_equal_block_types()} indicates if two spans across one or
245 * more input texts represent the same fully-qualified block type.
246 *
247 * ## Subclassing
248 *
249 * This processor is designed to accurately parse a block document. Therefore, many
250 * of its methods are not meant for subclassing. However, overall this class supports
251 * building higher-level convenience classes which may choose to subclass it. For those
252 * classes, avoid re-implementing methods except for the list below. Instead, create
253 * new names representing the higher-level concepts being introduced. For example, instead
254 * of creating a new method named `next_block()` which only advances to blocks of a given
255 * kind, consider creating a new method named something like `next_layout_block()` which
256 * won’t interfere with the base class method.
257 *
258 * - {@see static::get_last_error()} may be reimplemented to report new errors in the subclass
259 * which aren’t intrinsic to block parsing.
260 * - {@see static::get_attributes()} may be reimplemented to provide a streaming interface
261 * to reading and modifying a block’s JSON attributes. It should be fast and memory efficient.
262 * - {@see static::get_last_json_error()} may be reimplemented to report new errors introduced
263 * with a reimplementation of {@see static::get_attributes()}.
264 *
265 * @since 6.9.0
266 */
267class WP_Block_Processor {
268 /**
269 * Indicates if the last operation failed, otherwise
270 * will be `null` for success.
271 *
272 * @since 6.9.0
273 *
274 * @var string|null
275 */
276 private $last_error = null;
277
278 /**
279 * Indicates failures from decoding JSON attributes.
280 *
281 * @since 6.9.0
282 *
283 * @see \json_last_error()
284 *
285 * @var int
286 */
287 private $last_json_error = JSON_ERROR_NONE;
288
289 /**
290 * Source text provided to processor.
291 *
292 * @since 6.9.0
293 *
294 * @var string
295 */
296 protected $source_text;
297
298 /**
299 * Byte offset into source text where a matched delimiter starts.
300 *
301 * Example:
302 *
303 * 5 10 15 20 25 30 35 40 45 50
304 * <!-- wp:group --><!-- wp:void /--><!-- /wp:group -->
305 * ╰─ Starts at byte offset 17.
306 *
307 * @since 6.9.0
308 *
309 * @var int
310 */
311 private $matched_delimiter_at = 0;
312
313 /**
314 * Byte length of full span of a matched delimiter.
315 *
316 * Example:
317 *
318 * 5 10 15 20 25 30 35 40 45 50
319 * <!-- wp:group --><!-- wp:void /--><!-- /wp:group -->
320 * ╰───────────────╯
321 * 17 bytes long.
322 *
323 * @since 6.9.0
324 *
325 * @var int
326 */
327 private $matched_delimiter_length = 0;
328
329 /**
330 * First byte offset into source text following any previously-matched delimiter.
331 * Used to indicate where an HTML span starts.
332 *
333 * Example:
334 *
335 * 5 10 15 20 25 30 35 40 45 50 55
336 * <!-- wp:paragraph --><p>Content</p><⃨!⃨-⃨-⃨ ⃨/⃨w⃨p⃨:⃨p⃨a⃨r⃨a⃨g⃨r⃨a⃨p⃨h⃨ ⃨-⃨-⃨>⃨
337 * │ ╰─ This delimiter was matched, and after matching,
338 * │ revealed the preceding HTML span.
339 * │
340 * ╰─ The first byte offset after the previous matched delimiter
341 * is 21. Because the matched delimiter starts at 55, which is after
342 * this, a span of HTML must exist between these boundaries.
343 *
344 * @since 6.9.0
345 *
346 * @var int
347 */
348 private $after_previous_delimiter = 0;
349
350 /**
351 * Byte offset where namespace span begins.
352 *
353 * When no namespace is present, this will be the same as the starting
354 * byte offset for the block name.
355 *
356 * Example:
357 *
358 * <!-- wp:core/gallery -->
359 * │ ╰─ Name starts here.
360 * ╰─ Namespace starts here.
361 *
362 * <!-- wp:gallery -->
363 * ├─ The namespace would start here but is implied as “core.”
364 * ╰─ The name starts here.
365 *
366 * @since 6.9.0
367 *
368 * @var int
369 */
370 private $namespace_at = 0;
371
372 /**
373 * Byte offset where block name span begins.
374 *
375 * When no namespace is present, this will be the same as the starting
376 * byte offset for the block namespace.
377 *
378 * Example:
379 *
380 * <!-- wp:core/gallery -->
381 * │ ╰─ Name starts here.
382 * ╰─ Namespace starts here.
383 *
384 * <!-- wp:gallery -->
385 * ├─ The namespace would start here but is implied as “core.”
386 * ╰─ The name starts here.
387 *
388 * @since 6.9.0
389 *
390 * @var int
391 */
392 private $name_at = 0;
393
394 /**
395 * Byte length of block name span.
396 *
397 * Example:
398 *
399 * 5 10 15 20 25
400 * <!-- wp:core/gallery -->
401 * ╰─────╯
402 * 7 bytes long.
403 *
404 * @since 6.9.0
405 *
406 * @var int
407 */
408 private $name_length = 0;
409
410 /**
411 * Whether the delimiter contains the block-closing flag.
412 *
413 * This may be erroneous if present within a void block,
414 * therefore the {@see self::has_closing_flag()} can be used by
415 * calling code to perform custom error-handling.
416 *
417 * @since 6.9.0
418 *
419 * @var bool
420 */
421 private $has_closing_flag = false;
422
423 /**
424 * Byte offset where JSON attributes span begins.
425 *
426 * Example:
427 *
428 * 5 10 15 20 25 30 35 40
429 * <!-- wp:paragraph {"dropCaps":true} -->
430 * ╰─ Starts at byte offset 18.
431 *
432 * @since 6.9.0
433 *
434 * @var int
435 */
436 private $json_at;
437
438 /**
439 * Byte length of JSON attributes span, or 0 if none are present.
440 *
441 * Example:
442 *
443 * 5 10 15 20 25 30 35 40
444 * <!-- wp:paragraph {"dropCaps":true} -->
445 * ╰───────────────╯
446 * 17 bytes long.
447 *
448 * @since 6.9.0
449 *
450 * @var int
451 */
452 private $json_length = 0;
453
454 /**
455 * Internal parser state, differentiating whether the instance is currently matched,
456 * on an implicit freeform node, in error, or ready to begin parsing.
457 *
458 * @see self::READY
459 * @see self::MATCHED
460 * @see self::HTML_SPAN
461 * @see self::INCOMPLETE_INPUT
462 * @see self::COMPLETE
463 *
464 * @since 6.9.0
465 *
466 * @var string
467 */
468 protected $state = self::READY;
469
470 /**
471 * Indicates what kind of block comment delimiter was matched.
472 *
473 * One of:
474 *
475 * - {@see self::OPENER} If the delimiter is opening a block.
476 * - {@see self::CLOSER} If the delimiter is closing an open block.
477 * - {@see self::VOID} If the delimiter represents a void block with no inner content.
478 *
479 * If a parsed comment delimiter contains both the closing and the void
480 * flags then it will be interpreted as a void block to match the behavior
481 * of the official block parser, however, this is a syntax error and probably
482 * the block ought to close an open block of the same name, if one is open.
483 *
484 * @since 6.9.0
485 *
486 * @var string
487 */
488 private $type;
489
490 /**
491 * Whether the last-matched delimiter acts like a void block and should be
492 * popped from the stack of open blocks as soon as the parser advances.
493 *
494 * This applies to void block delimiters and to HTML spans.
495 *
496 * @since 6.9.0
497 *
498 * @var bool
499 */
500 private $was_void = false;
501
502 /**
503 * For every open block, in hierarchical order, this stores the byte offset
504 * into the source text where the block type starts, including for HTML spans.
505 *
506 * To avoid allocating and normalizing block names when they aren’t requested,
507 * the stack of open blocks is stored as the byte offsets and byte lengths of
508 * each open block’s block type. This allows for minimal tracking and quick
509 * reading or comparison of block types when requested.
510 *
511 * @since 6.9.0
512 *
513 * @see self::$open_blocks_length
514 *
515 * @var int[]
516 */
517 private $open_blocks_at = array();
518
519 /**
520 * For every open block, in hierarchical order, this stores the byte length
521 * of the block’s block type in the source text. For HTML spans this is 0.
522 *
523 * @since 6.9.0
524 *
525 * @see self::$open_blocks_at
526 *
527 * @var int[]
528 */
529 private $open_blocks_length = array();
530
531 /**
532 * Indicates which operation should apply to the stack of open blocks after
533 * processing any pending spans of HTML.
534 *
535 * Since HTML spans are discovered after matching block delimiters, those
536 * delimiters need to defer modifying the stack of open blocks. This value,
537 * if set, indicates what operation should be applied. The properties
538 * associated with token boundaries still point to the delimiters even
539 * when processing HTML spans, so there’s no need to track them independently.
540 *
541 * @var 'push'|'void'|'pop'|null
542 */
543 private $next_stack_op = null;
544
545 /**
546 * Creates a new block processor.
547 *
548 * Example:
549 *
550 * $processor = new WP_Block_Processor( $post_content );
551 * if ( $processor->next_block( 'core/image' ) ) {
552 * echo "Found an image!\n";
553 * }
554 *
555 * @see self::next_block() to advance to the start of the next block (skips closers).
556 * @see self::next_delimiter() to advance to the next explicit block delimiter.
557 * @see self::next_token() to advance to the next block delimiter or HTML span.
558 *
559 * @since 6.9.0
560 *
561 * @param string $source_text Input document potentially containing block content.
562 */
563 public function __construct( string $source_text ) {
564 $this->source_text = $source_text;
565 }
566
567 /**
568 * Advance to the next block delimiter which opens a block, indicating if one was found.
569 *
570 * Delimiters which open blocks include opening and void block delimiters. To visit
571 * freeform HTML content, pass the wildcard “*” as the block type.
572 *
573 * Use this function to walk through the blocks in a document, pausing where they open.
574 *
575 * Example blocks:
576 *
577 * // The first delimiter opens the paragraph block.
578 * <⃨!⃨-⃨-⃨ ⃨w⃨p⃨:⃨p⃨a⃨r⃨a⃨g⃨r⃨a⃨p⃨h⃨ ⃨-⃨-⃨>⃨<p>Content</p><!-- /wp:paragraph-->
579 *
580 * // The void block is the first opener in this sequence of closers.
581 * <!-- /wp:group --><⃨!⃨-⃨-⃨ ⃨w⃨p⃨:⃨s⃨p⃨a⃨c⃨e⃨r⃨ ⃨{⃨"⃨h⃨e⃨i⃨g⃨h⃨t⃨"⃨:⃨"⃨2⃨0⃨0⃨p⃨x⃨"⃨}⃨ ⃨/⃨-⃨-⃨>⃨<!-- /wp:group -->
582 *
583 * // If, however, `*` is provided as the block type, freeform content is matched.
584 * <⃨h⃨2⃨>⃨M⃨y⃨ ⃨s⃨y⃨n⃨o⃨p⃨s⃨i⃨s⃨<⃨/⃨h⃨2⃨>⃨\⃨n⃨<!-- wp:my/table-of-contents /-->
585 *
586 * // Inner HTML is never freeform content, and will not be matched even with the wildcard.
587 * <!-- /wp:list-item --></ul><!-- /wp:list --><⃨!⃨-⃨-⃨ ⃨w⃨p⃨:⃨p⃨a⃨r⃨a⃨g⃨r⃨a⃨p⃨h⃨ ⃨-⃨>⃨<p>
588 *
589 * Example:
590 *
591 * // Find all textual ranges of image block opening delimiters.
592 * $images = array();
593 * $processor = new WP_Block_Processor( $html );
594 * while ( $processor->next_block( 'core/image' ) ) {
595 * $images[] = $processor->get_span();
596 * }
597 *
598 * In some cases it may be useful to conditionally visit the implicit freeform
599 * blocks, such as when determining if a post contains freeform content that
600 * isn’t purely whitespace.
601 *
602 * Example:
603 *
604 * $seen_block_types = [];
605 * $block_type = '*';
606 * $processor = new WP_Block_Processor( $html );
607 * while ( $processor->next_block( $block_type ) {
608 * // Stop wasting time visiting freeform blocks after one has been found.
609 * if (
610 * '*' === $block_type &&
611 * null === $processor->get_block_type() &&
612 * $processor->is_non_whitespace_html()
613 * ) {
614 * $block_type = null;
615 * $seen_block_types['core/freeform'] = true;
616 * continue;
617 * }
618 *
619 * $seen_block_types[ $processor->get_block_type() ] = true;
620 * }
621 *
622 * @since 6.9.0
623 *
624 * @see self::next_delimiter() to advance to the next explicit block delimiter.
625 * @see self::next_token() to advance to the next block delimiter or HTML span.
626 *
627 * @param string|null $block_type Optional. If provided, advance until a block of this type is found.
628 * Default is to stop at any block regardless of its type.
629 * @return bool Whether an opening delimiter for a block was found.
630 */
631 public function next_block( ?string $block_type = null ): bool {
632 while ( $this->next_delimiter( $block_type ) ) {
633 if ( self::CLOSER !== $this->get_delimiter_type() ) {
634 return true;
635 }
636 }
637
638 return false;
639 }
640
641 /**
642 * Advance to the next block delimiter in a document, indicating if one was found.
643 *
644 * Delimiters may include invalid JSON. This parser does not attempt to parse the
645 * JSON attributes until requested; when invalid, the attributes will be null. This
646 * matches the behavior of {@see \parse_blocks()}. To visit freeform HTML content,
647 * pass the wildcard “*” as the block type.
648 *
649 * Use this function to walk through the block delimiters in a document.
650 *
651 * Example delimiters:
652 *
653 * <!-- wp:paragraph {"dropCap": true} -->
654 * <!-- wp:separator /-->
655 * <!-- /wp:paragraph -->
656 *
657 * // If the wildcard `*` is provided as the block type, freeform content is matched.
658 * <⃨h⃨2⃨>⃨M⃨y⃨ ⃨s⃨y⃨n⃨o⃨p⃨s⃨i⃨s⃨<⃨/⃨h⃨2⃨>⃨\⃨n⃨<!-- wp:my/table-of-contents /-->
659 *
660 * // Inner HTML is never freeform content, and will not be matched even with the wildcard.
661 * ...</ul><⃨!⃨-⃨-⃨ ⃨/⃨w⃨p⃨:⃨l⃨i⃨s⃨t⃨ ⃨-⃨-⃨>⃨<!-- wp:paragraph --><p>
662 *
663 * Example:
664 *
665 * $html = '<!-- wp:void /-->\n<!-- wp:void /-->';
666 * $processor = new WP_Block_Processor( $html );
667 * while ( $processor->next_delimiter() {
668 * // Runs twice, seeing both void blocks of type “core/void.”
669 * }
670 *
671 * $processor = new WP_Block_Processor( $html );
672 * while ( $processor->next_delimiter( '*' ) ) {
673 * // Runs thrice, seeing the void block, the newline span, and the void block.
674 * }
675 *
676 * @since 6.9.0
677 *
678 * @param string|null $block_name Optional. Keep searching until a block of this name is found.
679 * Defaults to visit every block regardless of type.
680 * @return bool Whether a block delimiter was matched.
681 */
682 public function next_delimiter( ?string $block_name = null ): bool {
683 if ( ! isset( $block_name ) ) {
684 while ( $this->next_token() ) {
685 if ( ! $this->is_html() ) {
686 return true;
687 }
688 }
689
690 return false;
691 }
692
693 while ( $this->next_token() ) {
694 if ( $this->is_block_type( $block_name ) ) {
695 return true;
696 }
697 }
698
699 return false;
700 }
701
702 /**
703 * Advance to the next block delimiter or HTML span in a document, indicating if one was found.
704 *
705 * This function steps through every syntactic chunk in a document. This includes explicit
706 * block comment delimiters, freeform non-block content, and inner HTML segments.
707 *
708 * Example tokens:
709 *
710 * <!-- wp:paragraph {"dropCap": true} -->
711 * <!-- wp:separator /-->
712 * <!-- /wp:paragraph -->
713 * <p>Normal HTML content</p>
714 * Plaintext content too!
715 *
716 * Example:
717 *
718 * // Find span containing wrapping HTML element surrounding inner blocks.
719 * $processor = new WP_Block_Processor( $html );
720 * if ( ! $processor->next_block( 'gallery' ) ) {
721 * return null;
722 * }
723 *
724 * $containing_span = null;
725 * while ( $processor->next_token() && $processor->is_html() ) {
726 * $containing_span = $processor->get_span();
727 * }
728 *
729 * This method will visit all HTML spans including those forming freeform non-block
730 * content as well as those which are part of a block’s inner HTML.
731 *
732 * @since 6.9.0
733 *
734 * @return bool Whether a token was matched or the end of the document was reached without finding any.
735 */
736 public function next_token(): bool {
737 if ( $this->last_error || self::COMPLETE === $this->state || self::INCOMPLETE_INPUT === $this->state ) {
738 return false;
739 }
740
741 // Void tokens automatically pop off the stack of open blocks.
742 if ( $this->was_void ) {
743 array_pop( $this->open_blocks_at );
744 array_pop( $this->open_blocks_length );
745 $this->was_void = false;
746 }
747
748 $text = $this->source_text;
749 $end = strlen( $text );
750
751 /*
752 * Because HTML spans are inferred after finding the next delimiter, it means that
753 * the parser must transition out of that HTML state and reuse the token boundaries
754 * it found after the HTML span. If those boundaries are before the end of the
755 * document it implies that a real delimiter was found; otherwise this must be the
756 * terminating HTML span and the parsing is complete.
757 */
758 if ( self::HTML_SPAN === $this->state ) {
759 if ( $this->matched_delimiter_at >= $end ) {
760 $this->state = self::COMPLETE;
761 return false;
762 }
763
764 switch ( $this->next_stack_op ) {
765 case 'void':
766 $this->was_void = true;
767 $this->open_blocks_at[] = $this->namespace_at;
768 $this->open_blocks_length[] = $this->name_at + $this->name_length - $this->namespace_at;
769 break;
770
771 case 'push':
772 $this->open_blocks_at[] = $this->namespace_at;
773 $this->open_blocks_length[] = $this->name_at + $this->name_length - $this->namespace_at;
774 break;
775
776 case 'pop':
777 array_pop( $this->open_blocks_at );
778 array_pop( $this->open_blocks_length );
779 break;
780 }
781
782 $this->next_stack_op = null;
783 $this->state = self::MATCHED;
784 return true;
785 }
786
787 $this->state = self::READY;
788 $after_prev_delimiter = $this->matched_delimiter_at + $this->matched_delimiter_length;
789 $at = $after_prev_delimiter;
790
791 while ( $at < $end ) {
792 /*
793 * Find the next possible start of a delimiter.
794 *
795 * This follows the behavior in the official block parser, which segments a post
796 * by the block comment delimiters. It is possible for an HTML attribute to contain
797 * what looks like a block comment delimiter but which is actually an HTML attribute
798 * value. In such a case, the parser here will break apart the HTML and create the
799 * block boundary inside the HTML attribute. In other words, the block parser
800 * isolates sections of HTML from each other, even if that leads to malformed markup.
801 *
802 * For a more robust parse, scan through the document with the HTML API and parse
803 * comments once they are matched to see if they are also block delimiters. In
804 * practice, this nuance has not caused any known problems since developing blocks.
805 *
806 * <⃨!⃨-⃨-⃨ /wp:core/paragraph {"dropCap":true} /-->
807 */
808 $comment_opening_at = strpos( $text, '<!--', $at );
809
810 /*
811 * Even if the start of a potential block delimiter is not found, the document
812 * might end in a prefix of such, and in that case there is incomplete input.
813 */
814 if ( false === $comment_opening_at ) {
815 if ( str_ends_with( $text, '<!-' ) ) {
816 $backup = 3;
817 } elseif ( str_ends_with( $text, '<!' ) ) {
818 $backup = 2;
819 } elseif ( str_ends_with( $text, '<' ) ) {
820 $backup = 1;
821 } else {
822 $backup = 0;
823 }
824
825 // Whether or not there is a potential delimiter, there might be an HTML span.
826 if ( $after_prev_delimiter < ( $end - $backup ) ) {
827 $this->state = self::HTML_SPAN;
828 $this->after_previous_delimiter = $after_prev_delimiter;
829 $this->matched_delimiter_at = $end - $backup;
830 $this->matched_delimiter_length = $backup;
831 $this->open_blocks_at[] = $after_prev_delimiter;
832 $this->open_blocks_length[] = 0;
833 $this->was_void = true;
834 return true;
835 }
836
837 /*
838 * In the case that there is the start of an HTML comment, it means that there
839 * might be a block delimiter, but it’s not possible know, therefore it’s incomplete.
840 */
841 if ( $backup > 0 ) {
842 goto incomplete;
843 }
844
845 // Otherwise this is the end.
846 $this->state = self::COMPLETE;
847 return false;
848 }
849
850 // <!-- ⃨/wp:core/paragraph {"dropCap":true} /-->
851 $opening_whitespace_at = $comment_opening_at + 4;
852 if ( $opening_whitespace_at >= $end ) {
853 goto incomplete;
854 }
855
856 $opening_whitespace_length = strspn( $text, " \t\f\r\n", $opening_whitespace_at );
857
858 /*
859 * The `wp` prefix cannot come before this point, but it may come after it
860 * depending on the presence of the closer. This is detected next.
861 */
862 $wp_prefix_at = $opening_whitespace_at + $opening_whitespace_length;
863 if ( $wp_prefix_at >= $end ) {
864 goto incomplete;
865 }
866
867 if ( 0 === $opening_whitespace_length ) {
868 $at = $this->find_html_comment_end( $comment_opening_at, $end );
869 continue;
870 }
871
872 // <!-- /⃨wp:core/paragraph {"dropCap":true} /-->
873 $has_closer = false;
874 if ( '/' === $text[ $wp_prefix_at ] ) {
875 $has_closer = true;
876 ++$wp_prefix_at;
877 }
878
879 // <!-- /w⃨p⃨:⃨core/paragraph {"dropCap":true} /-->
880 if ( $wp_prefix_at < $end && 0 !== substr_compare( $text, 'wp:', $wp_prefix_at, 3 ) ) {
881 if (
882 ( $wp_prefix_at + 2 >= $end && str_ends_with( $text, 'wp' ) ) ||
883 ( $wp_prefix_at + 1 >= $end && str_ends_with( $text, 'w' ) )
884 ) {
885 goto incomplete;
886 }
887
888 $at = $this->find_html_comment_end( $comment_opening_at, $end );
889 continue;
890 }
891
892 /*
893 * If the block contains no namespace, this will end up masquerading with
894 * the block name. It’s easier to first detect the span and then determine
895 * if it’s a namespace of a name.
896 *
897 * <!-- /wp:c⃨o⃨r⃨e⃨/paragraph {"dropCap":true} /-->
898 */
899 $namespace_at = $wp_prefix_at + 3;
900 if ( $namespace_at >= $end ) {
901 goto incomplete;
902 }
903
904 $start_of_namespace = $text[ $namespace_at ];
905
906 // The namespace must start with a-z.
907 if ( 'a' > $start_of_namespace || 'z' < $start_of_namespace ) {
908 $at = $this->find_html_comment_end( $comment_opening_at, $end );
909 continue;
910 }
911
912 $namespace_length = 1 + strspn( $text, 'abcdefghijklmnopqrstuvwxyz0123456789-_', $namespace_at + 1 );
913 $separator_at = $namespace_at + $namespace_length;
914 if ( $separator_at >= $end ) {
915 goto incomplete;
916 }
917
918 // <!-- /wp:core/⃨paragraph {"dropCap":true} /-->
919 $has_separator = '/' === $text[ $separator_at ];
920 if ( $has_separator ) {
921 $name_at = $separator_at + 1;
922
923 if ( $name_at >= $end ) {
924 goto incomplete;
925 }
926
927 // <!-- /wp:core/p⃨a⃨r⃨a⃨g⃨r⃨a⃨p⃨h⃨ {"dropCap":true} /-->
928 $start_of_name = $text[ $name_at ];
929 if ( 'a' > $start_of_name || 'z' < $start_of_name ) {
930 $at = $this->find_html_comment_end( $comment_opening_at, $end );
931 continue;
932 }
933
934 $name_length = 1 + strspn( $text, 'abcdefghijklmnopqrstuvwxyz0123456789-_', $name_at + 1 );
935 } else {
936 $name_at = $namespace_at;
937 $name_length = $namespace_length;
938 }
939
940 if ( $name_at + $name_length >= $end ) {
941 goto incomplete;
942 }
943
944 /*
945 * For this next section of the delimiter, it could be the JSON attributes
946 * or it could be the end of the comment. Assume that the JSON is there and
947 * update if it’s not.
948 */
949
950 // <!-- /wp:core/paragraph ⃨{"dropCap":true} /-->
951 $after_name_whitespace_at = $name_at + $name_length;
952 $after_name_whitespace_length = strspn( $text, " \t\f\r\n", $after_name_whitespace_at );
953 $json_at = $after_name_whitespace_at + $after_name_whitespace_length;
954
955 if ( $json_at >= $end ) {
956 goto incomplete;
957 }
958
959 if ( 0 === $after_name_whitespace_length ) {
960 $at = $this->find_html_comment_end( $comment_opening_at, $end );
961 continue;
962 }
963
964 // <!-- /wp:core/paragraph {⃨"dropCap":true} /-->
965 $has_json = '{' === $text[ $json_at ];
966 $json_length = 0;
967
968 /*
969 * For the final span of the delimiter it's most efficient to find the end of the
970 * HTML comment and work backwards. This prevents complicated parsing inside the
971 * JSON span, which is not allowed to contain the HTML comment terminator.
972 *
973 * This also matches the behavior in the official block parser,
974 * even though it allows for matching invalid JSON content.
975 *
976 * <!-- /wp:core/paragraph {"dropCap":true} /-⃨-⃨>⃨
977 */
978 $comment_closing_at = strpos( $text, '-->', $json_at );
979 if ( false === $comment_closing_at ) {
980 goto incomplete;
981 }
982
983 // <!-- /wp:core/paragraph {"dropCap":true} /⃨-->
984 if ( '/' === $text[ $comment_closing_at - 1 ] ) {
985 $has_void_flag = true;
986 $void_flag_length = 1;
987 } else {
988 $has_void_flag = false;
989 $void_flag_length = 0;
990 }
991
992 /*
993 * If there's no JSON, then the span of text after the name
994 * until the comment closing must be completely whitespace.
995 * Otherwise it’s a normal HTML comment.
996 */
997 if ( ! $has_json ) {
998 if ( $after_name_whitespace_at + $after_name_whitespace_length === $comment_closing_at - $void_flag_length ) {
999 // This must be a block delimiter!
1000 $this->state = self::MATCHED;
1001 break;
1002 }
1003
1004 $at = $this->find_html_comment_end( $comment_opening_at, $end );
1005 continue;
1006 }
1007
1008 /*
1009 * There's JSON, so attempt to find its boundary.
1010 *
1011 * @todo It’s likely faster to scan forward instead of in reverse.
1012 *
1013 * <!-- /wp:core/paragraph {"dropCap":true}⃨ ⃨/-->
1014 */
1015 $after_json_whitespace_length = 0;
1016 for ( $char_at = $comment_closing_at - $void_flag_length - 1; $char_at > $json_at; $char_at-- ) {
1017 $char = $text[ $char_at ];
1018
1019 switch ( $char ) {
1020 case ' ':
1021 case "\t":
1022 case "\f":
1023 case "\r":
1024 case "\n":
1025 ++$after_json_whitespace_length;
1026 continue 2;
1027
1028 case '}':
1029 $json_length = $char_at - $json_at + 1;
1030 break 2;
1031
1032 default:
1033 ++$at;
1034 continue 3;
1035 }
1036 }
1037
1038 /*
1039 * This covers cases where there is no terminating “}” or where
1040 * mandatory whitespace is missing.
1041 */
1042 if ( 0 === $json_length || 0 === $after_json_whitespace_length ) {
1043 $at = $this->find_html_comment_end( $comment_opening_at, $end );
1044 continue;
1045 }
1046
1047 // This must be a block delimiter!
1048 $this->state = self::MATCHED;
1049 break;
1050 }
1051
1052 // The end of the document was reached without a match.
1053 if ( self::MATCHED !== $this->state ) {
1054 $this->state = self::COMPLETE;
1055 return false;
1056 }
1057
1058 /*
1059 * From this point forward, a delimiter has been matched. There
1060 * might also be an HTML span that appears before the delimiter.
1061 */
1062
1063 $this->after_previous_delimiter = $after_prev_delimiter;
1064
1065 $this->matched_delimiter_at = $comment_opening_at;
1066 $this->matched_delimiter_length = $comment_closing_at + 3 - $comment_opening_at;
1067
1068 $this->namespace_at = $namespace_at;
1069 $this->name_at = $name_at;
1070 $this->name_length = $name_length;
1071
1072 $this->json_at = $json_at;
1073 $this->json_length = $json_length;
1074
1075 /*
1076 * When delimiters contain both the void flag and the closing flag
1077 * they shall be interpreted as void blocks, per the spec parser.
1078 */
1079 if ( $has_void_flag ) {
1080 $this->type = self::VOID;
1081 $this->next_stack_op = 'void';
1082 } elseif ( $has_closer ) {
1083 $this->type = self::CLOSER;
1084 $this->next_stack_op = 'pop';
1085
1086 /*
1087 * @todo Check if the name matches and bail according to the spec parser.
1088 * The default parser doesn’t examine the names.
1089 */
1090 } else {
1091 $this->type = self::OPENER;
1092 $this->next_stack_op = 'push';
1093 }
1094
1095 $this->has_closing_flag = $has_closer;
1096
1097 // HTML spans are visited before the delimiter that follows them.
1098 if ( $comment_opening_at > $after_prev_delimiter ) {
1099 $this->state = self::HTML_SPAN;
1100 $this->open_blocks_at[] = $after_prev_delimiter;
1101 $this->open_blocks_length[] = 0;
1102 $this->was_void = true;
1103
1104 return true;
1105 }
1106
1107 // If there were no HTML spans then flush the enqueued stack operations immediately.
1108 switch ( $this->next_stack_op ) {
1109 case 'void':
1110 $this->was_void = true;
1111 $this->open_blocks_at[] = $namespace_at;
1112 $this->open_blocks_length[] = $name_at + $name_length - $namespace_at;
1113 break;
1114
1115 case 'push':
1116 $this->open_blocks_at[] = $namespace_at;
1117 $this->open_blocks_length[] = $name_at + $name_length - $namespace_at;
1118 break;
1119
1120 case 'pop':
1121 array_pop( $this->open_blocks_at );
1122 array_pop( $this->open_blocks_length );
1123 break;
1124 }
1125
1126 $this->next_stack_op = null;
1127
1128 return true;
1129
1130 incomplete:
1131 $this->state = self::COMPLETE;
1132 $this->last_error = self::INCOMPLETE_INPUT;
1133 return false;
1134 }
1135
1136 /**
1137 * Returns an array containing the names of the currently-open blocks, in order
1138 * from outermost to innermost, with HTML spans indicated as “#html”.
1139 *
1140 * Example:
1141 *
1142 * // Freeform HTML content is an HTML span.
1143 * $processor = new WP_Block_Processor( 'Just text' );
1144 * $processor->next_token();
1145 * array( '#text' ) === $processor->get_breadcrumbs();
1146 *
1147 * $processor = new WP_Block_Processor( '<!-- wp:a --><!-- wp:b --><!-- wp:c /--><!-- /wp:b --><!-- /wp:a -->' );
1148 * $processor->next_token();
1149 * array( 'core/a' ) === $processor->get_breadcrumbs();
1150 * $processor->next_token();
1151 * array( 'core/a', 'core/b' ) === $processor->get_breadcrumbs();
1152 * $processor->next_token();
1153 * // Void blocks are only open while visiting them.
1154 * array( 'core/a', 'core/b', 'core/c' ) === $processor->get_breadcrumbs();
1155 * $processor->next_token();
1156 * // Blocks are closed before visiting their closing delimiter.
1157 * array( 'core/a' ) === $processor->get_breadcrumbs();
1158 * $processor->next_token();
1159 * array() === $processor->get_breadcrumbs();
1160 *
1161 * // Inner HTML is also an HTML span.
1162 * $processor = new WP_Block_Processor( '<!-- wp:a -->Inner HTML<!-- /wp:a -->' );
1163 * $processor->next_token();
1164 * $processor->next_token();
1165 * array( 'core/a', '#html' ) === $processor->get_breadcrumbs();
1166 *
1167 * @since 6.9.0
1168 *
1169 * @return string[]
1170 */
1171 public function get_breadcrumbs(): array {
1172 $breadcrumbs = array_fill( 0, count( $this->open_blocks_at ), null );
1173
1174 /*
1175 * Since HTML spans can only be at the very end, set the normalized block name for
1176 * each open element and then work backwards after creating the array. This allows
1177 * for the elimination of a conditional on each iteration of the loop.
1178 */
1179 foreach ( $this->open_blocks_at as $i => $at ) {
1180 $block_type = substr( $this->source_text, $at, $this->open_blocks_length[ $i ] );
1181 $breadcrumbs[ $i ] = self::normalize_block_type( $block_type );
1182 }
1183
1184 if ( isset( $i ) && 0 === $this->open_blocks_length[ $i ] ) {
1185 $breadcrumbs[ $i ] = '#html';
1186 }
1187
1188 return $breadcrumbs;
1189 }
1190
1191 /**
1192 * Returns the depth of the open blocks where the processor is currently matched.
1193 *
1194 * Depth increases before visiting openers and void blocks and decreases before
1195 * visiting closers. HTML spans behave like void blocks.
1196 *
1197 * @since 6.9.0
1198 *
1199 * @return int
1200 */
1201 public function get_depth(): int {
1202 return count( $this->open_blocks_at );
1203 }
1204
1205 /**
1206 * Extracts a block object, and all inner content, starting at a matched opening
1207 * block delimiter, or at a matched top-level HTML span as freeform HTML content.
1208 *
1209 * Use this function to extract some blocks within a document, but not all. For example,
1210 * one might want to find image galleries, parse them, modify them, and then reserialize
1211 * them in place.
1212 *
1213 * Once this function returns, the parser will be matched on token following the close
1214 * of the given block.
1215 *
1216 * The return type of this method is compatible with the return of {@see \parse_blocks()}.
1217 *
1218 * Example:
1219 *
1220 * $processor = new WP_Block_Processor( $post_content );
1221 * if ( ! $processor->next_block( 'gallery' ) ) {
1222 * return $post_content;
1223 * }
1224 *
1225 * $gallery_at = $processor->get_span()->start;
1226 * $gallery = $processor->extract_full_block_and_advance();
1227 * $ends_before = $processor->get_span();
1228 * $ends_before = $ends_before->start ?? strlen( $post_content );
1229 *
1230 * $new_gallery = update_gallery( $gallery );
1231 * $new_gallery = serialize_block( $new_gallery );
1232 *
1233 * return (
1234 * substr( $post_content, 0, $gallery_at ) .
1235 * $new_gallery .
1236 * substr( $post_content, $ends_before )
1237 * );
1238 *
1239 * @since 6.9.0
1240 *
1241 * @return array[]|null {
1242 * Array of block structures.
1243 *
1244 * @type array ...$0 {
1245 * An associative array of a single parsed block object. See WP_Block_Parser_Block.
1246 *
1247 * @type string|null $blockName Name of block.
1248 * @type array $attrs Attributes from block comment delimiters.
1249 * @type array[] $innerBlocks List of inner blocks. An array of arrays that
1250 * have the same structure as this one.
1251 * @type string $innerHTML HTML from inside block comment delimiters.
1252 * @type array $innerContent List of string fragments and null markers where
1253 * inner blocks were found.
1254 * }
1255 * }
1256 */
1257 public function extract_full_block_and_advance(): ?array {
1258 if ( $this->is_html() ) {
1259 $chunk = $this->get_html_content();
1260
1261 return array(
1262 'blockName' => null,
1263 'attrs' => array(),
1264 'innerBlocks' => array(),
1265 'innerHTML' => $chunk,
1266 'innerContent' => array( $chunk ),
1267 );
1268 }
1269
1270 $block = array(
1271 'blockName' => $this->get_block_type(),
1272 'attrs' => $this->allocate_and_return_parsed_attributes() ?? array(),
1273 'innerBlocks' => array(),
1274 'innerHTML' => '',
1275 'innerContent' => array(),
1276 );
1277
1278 $depth = $this->get_depth();
1279 while ( $this->next_token() && $this->get_depth() > $depth ) {
1280 if ( $this->is_html() ) {
1281 $chunk = $this->get_html_content();
1282 $block['innerHTML'] .= $chunk;
1283 $block['innerContent'][] = $chunk;
1284 continue;
1285 }
1286
1287 /**
1288 * Inner blocks.
1289 *
1290 * @todo This is a decent place to call {@link \render_block()}
1291 * @todo Use iteration instead of recursion, or at least refactor to tail-call form.
1292 */
1293 if ( $this->opens_block() ) {
1294 $inner_block = $this->extract_full_block_and_advance();
1295 $block['innerBlocks'][] = $inner_block;
1296 $block['innerContent'][] = null;
1297 }
1298
1299 /*
1300 * Because the parser has advanced past the closing block token, it
1301 * may be matched on an HTML span. This needs to be processed before
1302 * moving on to the next token at the start of the next loop iteration.
1303 */
1304 if ( $this->is_html() ) {
1305 $chunk = $this->get_html_content();
1306 $block['innerHTML'] .= $chunk;
1307 $block['innerContent'][] = $chunk;
1308 }
1309 }
1310
1311 return $block;
1312 }
1313
1314 /**
1315 * Returns the byte-offset after the ending character of an HTML comment,
1316 * assuming the proper starting byte offset.
1317 *
1318 * @since 6.9.0
1319 *
1320 * @param int $comment_starting_at Where the HTML comment started, the leading `<`.
1321 * @param int $search_end Last offset in which to search, for limiting search span.
1322 * @return int Offset after the current HTML comment ends, or `$search_end` if no end was found.
1323 */
1324 private function find_html_comment_end( int $comment_starting_at, int $search_end ): int {
1325 $text = $this->source_text;
1326
1327 // Find span-of-dashes comments which look like `<!----->`.
1328 $span_of_dashes = strspn( $text, '-', $comment_starting_at + 2 );
1329 if (
1330 $comment_starting_at + 2 + $span_of_dashes < $search_end &&
1331 '>' === $text[ $comment_starting_at + 2 + $span_of_dashes ]
1332 ) {
1333 return $comment_starting_at + $span_of_dashes + 1;
1334 }
1335
1336 // Otherwise, there are other characters inside the comment, find the first `-->` or `--!>`.
1337 $now_at = $comment_starting_at + 4;
1338 while ( $now_at < $search_end ) {
1339 $dashes_at = strpos( $text, '--', $now_at );
1340 if ( false === $dashes_at ) {
1341 return $search_end;
1342 }
1343
1344 $closer_must_be_at = $dashes_at + 2 + strspn( $text, '-', $dashes_at + 2 );
1345 if ( $closer_must_be_at < $search_end && '!' === $text[ $closer_must_be_at ] ) {
1346 ++$closer_must_be_at;
1347 }
1348
1349 if ( $closer_must_be_at < $search_end && '>' === $text[ $closer_must_be_at ] ) {
1350 return $closer_must_be_at + 1;
1351 }
1352
1353 ++$now_at;
1354 }
1355
1356 return $search_end;
1357 }
1358
1359 /**
1360 * Indicates if the last attempt to parse a block comment delimiter
1361 * failed, if set, otherwise `null` if the last attempt succeeded.
1362 *
1363 * @since 6.9.0
1364 *
1365 * @return string|null Error from last attempt at parsing next block delimiter,
1366 * or `null` if last attempt succeeded.
1367 */
1368 public function get_last_error(): ?string {
1369 return $this->last_error;
1370 }
1371
1372 /**
1373 * Indicates if the last attempt to parse a block’s JSON attributes failed.
1374 *
1375 * @see \json_last_error()
1376 *
1377 * @since 6.9.0
1378 *
1379 * @return int JSON_ERROR_ code from last attempt to parse block JSON attributes.
1380 */
1381 public function get_last_json_error(): int {
1382 return $this->last_json_error;
1383 }
1384
1385 /**
1386 * Returns the type of the block comment delimiter.
1387 *
1388 * One of:
1389 *
1390 * - {@see self::OPENER}
1391 * - {@see self::CLOSER}
1392 * - {@see self::VOID}
1393 * - `null`
1394 *
1395 * @since 6.9.0
1396 *
1397 * @return string|null type of the block comment delimiter, if currently matched.
1398 */
1399 public function get_delimiter_type(): ?string {
1400 switch ( $this->state ) {
1401 case self::HTML_SPAN:
1402 return self::VOID;
1403
1404 case self::MATCHED:
1405 return $this->type;
1406
1407 default:
1408 return null;
1409 }
1410 }
1411
1412 /**
1413 * Returns whether the delimiter contains the closing flag.
1414 *
1415 * This should be avoided except in cases of custom error-handling
1416 * with block closers containing the void flag. For normative use,
1417 * {@see self::get_delimiter_type()}.
1418 *
1419 * @since 6.9.0
1420 *
1421 * @return bool Whether the currently-matched block delimiter contains the closing flag.
1422 */
1423 public function has_closing_flag(): bool {
1424 return $this->has_closing_flag;
1425 }
1426
1427 /**
1428 * Indicates if the block delimiter represents a block of the given type.
1429 *
1430 * Since the “core” namespace may be implicit, it’s allowable to pass
1431 * either the fully-qualified block type with namespace and block name
1432 * as well as the shorthand version only containing the block name, if
1433 * the desired block is in the “core” namespace.
1434 *
1435 * Since freeform HTML content is non-block content, it has no block type.
1436 * Passing the wildcard “*” will, however, return true for all block types,
1437 * even the implicit freeform content, though not for spans of inner HTML.
1438 *
1439 * Example:
1440 *
1441 * $is_core_paragraph = $processor->is_block_type( 'paragraph' );
1442 * $is_core_paragraph = $processor->is_block_type( 'core/paragraph' );
1443 * $is_formula = $processor->is_block_type( 'math-block/formula' );
1444 *
1445 * @param string $block_type Block type name for the desired block.
1446 * E.g. "paragraph", "core/paragraph", "math-blocks/formula".
1447 * @return bool Whether this delimiter represents a block of the given type.
1448 */
1449 public function is_block_type( string $block_type ): bool {
1450 if ( '*' === $block_type ) {
1451 return true;
1452 }
1453
1454 if ( $this->is_html() ) {
1455 // This is a core/freeform text block, it’s special.
1456 if ( 0 === ( $this->open_blocks_length[0] ?? null ) ) {
1457 return (
1458 'core/freeform' === $block_type ||
1459 'freeform' === $block_type
1460 );
1461 }
1462
1463 // Otherwise this is innerHTML and not a block.
1464 return false;
1465 }
1466
1467 return $this->are_equal_block_types( $this->source_text, $this->namespace_at, $this->name_at - $this->namespace_at + $this->name_length, $block_type, 0, strlen( $block_type ) );
1468 }
1469
1470 /**
1471 * Given two spans of text, indicate if they represent identical block types.
1472 *
1473 * This function normalizes block types to account for implicit core namespacing.
1474 *
1475 * Note! This function only returns valid results when the complete block types are
1476 * represented in the span offsets and lengths. This means that the full optional
1477 * namespace and block name must be represented in the input arguments.
1478 *
1479 * Example:
1480 *
1481 * 0 5 10 15 20 25 30 35 40
1482 * $text = '<!-- wp:block --><!-- /wp:core/block -->';
1483 *
1484 * true === WP_Block_Processor::are_equal_block_types( $text, 9, 5, $text, 27, 10 );
1485 * false === WP_Block_Processor::are_equal_block_types( $text, 9, 5, 'my/block', 0, 8 );
1486 *
1487 * @since 6.9.0
1488 *
1489 * @param string $a_text Text in which first block type appears.
1490 * @param int $a_at Byte offset into text in which first block type starts.
1491 * @param int $a_length Byte length of first block type.
1492 * @param string $b_text Text in which second block type appears (may be the same as the first text).
1493 * @param int $b_at Byte offset into text in which second block type starts.
1494 * @param int $b_length Byte length of second block type.
1495 * @return bool Whether the spans of text represent identical block types, normalized for namespacing.
1496 */
1497 public static function are_equal_block_types( string $a_text, int $a_at, int $a_length, string $b_text, int $b_at, int $b_length ): bool {
1498 $a_ns_length = strcspn( $a_text, '/', $a_at, $a_length );
1499 $b_ns_length = strcspn( $b_text, '/', $b_at, $b_length );
1500
1501 $a_has_ns = $a_ns_length !== $a_length;
1502 $b_has_ns = $b_ns_length !== $b_length;
1503
1504 // Both contain namespaces.
1505 if ( $a_has_ns && $b_has_ns ) {
1506 if ( $a_length !== $b_length ) {
1507 return false;
1508 }
1509
1510 $a_block_type = substr( $a_text, $a_at, $a_length );
1511
1512 return 0 === substr_compare( $b_text, $a_block_type, $b_at, $b_length );
1513 }
1514
1515 if ( $a_has_ns ) {
1516 $b_block_type = 'core/' . substr( $b_text, $b_at, $b_length );
1517
1518 return (
1519 strlen( $b_block_type ) === $a_length &&
1520 0 === substr_compare( $a_text, $b_block_type, $a_at, $a_length )
1521 );
1522 }
1523
1524 if ( $b_has_ns ) {
1525 $a_block_type = 'core/' . substr( $a_text, $a_at, $a_length );
1526
1527 return (
1528 strlen( $a_block_type ) === $b_length &&
1529 0 === substr_compare( $b_text, $a_block_type, $b_at, $b_length )
1530 );
1531 }
1532
1533 // Neither contains a namespace.
1534 if ( $a_length !== $b_length ) {
1535 return false;
1536 }
1537
1538 $a_name = substr( $a_text, $a_at, $a_length );
1539
1540 return 0 === substr_compare( $b_text, $a_name, $b_at, $b_length );
1541 }
1542
1543 /**
1544 * Indicates if the matched delimiter is an opening or void delimiter of the given type,
1545 * if a type is provided, otherwise if it opens any block or implicit freeform HTML content.
1546 *
1547 * This is a helper method to ease handling of code inspecting where blocks start, and for
1548 * checking if the blocks are of a given type. The function is variadic to allow for
1549 * checking if the delimiter opens one of many possible block types.
1550 *
1551 * To advance to the start of a block {@see self::next_block()}.
1552 *
1553 * Example:
1554 *
1555 * $processor = new WP_Block_Processor( $html );
1556 * while ( $processor->next_delimiter() ) {
1557 * if ( $processor->opens_block( 'core/code', 'syntaxhighlighter/code' ) ) {
1558 * echo "Found code!";
1559 * continue;
1560 * }
1561 *
1562 * if ( $processor->opens_block( 'core/image' ) ) {
1563 * echo "Found an image!";
1564 * continue;
1565 * }
1566 *
1567 * if ( $processor->opens_block() ) {
1568 * echo "Found a new block!";
1569 * }
1570 * }
1571 *
1572 * @since 6.9.0
1573 *
1574 * @see self::is_block_type()
1575 *
1576 * @param string[] $block_type Optional. Is the matched block type one of these?
1577 * If none are provided, will not test block type.
1578 * @return bool Whether the matched block delimiter opens a block, and whether it
1579 * opens a block of one of the given block types, if provided.
1580 */
1581 public function opens_block( string ...$block_type ): bool {
1582 // HTML spans only open implicit freeform content at the top level.
1583 if ( self::HTML_SPAN === $this->state && 1 !== count( $this->open_blocks_at ) ) {
1584 return false;
1585 }
1586
1587 /*
1588 * Because HTML spans are discovered after the next delimiter is found,
1589 * the delimiter type when visiting HTML spans refers to the type of the
1590 * following delimiter. Therefore the HTML case is handled by checking
1591 * the state and depth of the stack of open block.
1592 */
1593 if ( self::CLOSER === $this->type && ! $this->is_html() ) {
1594 return false;
1595 }
1596
1597 if ( count( $block_type ) === 0 ) {
1598 return true;
1599 }
1600
1601 foreach ( $block_type as $block ) {
1602 if ( $this->is_block_type( $block ) ) {
1603 return true;
1604 }
1605 }
1606
1607 return false;
1608 }
1609
1610 /**
1611 * Indicates if the matched delimiter is an HTML span.
1612 *
1613 * @since 6.9.0
1614 *
1615 * @see self::is_non_whitespace_html()
1616 *
1617 * @return bool Whether the processor is matched on an HTML span.
1618 */
1619 public function is_html(): bool {
1620 return self::HTML_SPAN === $this->state;
1621 }
1622
1623 /**
1624 * Indicates if the matched delimiter is an HTML span and comprises more
1625 * than whitespace characters, i.e. contains real content.
1626 *
1627 * Many block serializers introduce newlines between block delimiters,
1628 * so the presence of top-level non-block content does not imply that
1629 * there are “real” freeform HTML blocks. Checking if there is content
1630 * beyond whitespace is a more certain check, such as for determining
1631 * whether to load CSS for the freeform or fallback block type.
1632 *
1633 * @since 6.9.0
1634 *
1635 * @see self::is_html()
1636 *
1637 * @return bool Whether the currently-matched delimiter is an HTML
1638 * span containing non-whitespace text.
1639 */
1640 public function is_non_whitespace_html(): bool {
1641 if ( ! $this->is_html() ) {
1642 return false;
1643 }
1644
1645 $length = $this->matched_delimiter_at - $this->after_previous_delimiter;
1646
1647 $whitespace_length = strspn(
1648 $this->source_text,
1649 " \t\f\r\n",
1650 $this->after_previous_delimiter,
1651 $length
1652 );
1653
1654 return $whitespace_length !== $length;
1655 }
1656
1657 /**
1658 * Returns the string content of a matched HTML span, or `null` otherwise.
1659 *
1660 * @since 6.9.0
1661 *
1662 * @return string|null Raw HTML content, or `null` if not currently matched on HTML.
1663 */
1664 public function get_html_content(): ?string {
1665 if ( ! $this->is_html() ) {
1666 return null;
1667 }
1668
1669 return substr(
1670 $this->source_text,
1671 $this->after_previous_delimiter,
1672 $this->matched_delimiter_at - $this->after_previous_delimiter
1673 );
1674 }
1675
1676 /**
1677 * Allocates a substring for the block type and returns the fully-qualified
1678 * name, including the namespace, if matched on a delimiter, otherwise `null`.
1679 *
1680 * This function is like {@see self::get_printable_block_type()} but when
1681 * paused on a freeform HTML block, will return `null` instead of “core/freeform”.
1682 * The `null` behavior matches what {@see \parse_blocks()} returns but may not
1683 * be as useful as having a string value.
1684 *
1685 * This function allocates a substring for the given block type. This
1686 * allocation will be small and likely fine in most cases, but it's
1687 * preferable to call {@see self::is_block_type()} if only needing
1688 * to know whether the delimiter is for a given block type, as that
1689 * function is more efficient for this purpose and avoids the allocation.
1690 *
1691 * Example:
1692 *
1693 * // Avoid.
1694 * 'core/paragraph' = $processor->get_block_type();
1695 *
1696 * // Prefer.
1697 * $processor->is_block_type( 'core/paragraph' );
1698 * $processor->is_block_type( 'paragraph' );
1699 * $processor->is_block_type( 'core/freeform' );
1700 *
1701 * // Freeform HTML content has no block type.
1702 * $processor = new WP_Block_Processor( 'non-block content' );
1703 * $processor->next_token();
1704 * null === $processor->get_block_type();
1705 *
1706 * @since 6.9.0
1707 *
1708 * @see self::are_equal_block_types()
1709 *
1710 * @return string|null Fully-qualified block namespace and type, e.g. "core/paragraph",
1711 * if matched on an explicit delimiter, otherwise `null`.
1712 */
1713 public function get_block_type(): ?string {
1714 if (
1715 self::READY === $this->state ||
1716 self::COMPLETE === $this->state ||
1717 self::INCOMPLETE_INPUT === $this->state
1718 ) {
1719 return null;
1720 }
1721
1722 // This is a core/freeform text block, it’s special.
1723 if ( $this->is_html() ) {
1724 return null;
1725 }
1726
1727 $block_type = substr( $this->source_text, $this->namespace_at, $this->name_at - $this->namespace_at + $this->name_length );
1728 return self::normalize_block_type( $block_type );
1729 }
1730
1731 /**
1732 * Allocates a printable substring for the block type and returns the fully-qualified
1733 * name, including the namespace, if matched on a delimiter or freeform block, otherwise `null`.
1734 *
1735 * This function is like {@see self::get_block_type()} but when paused on a freeform
1736 * HTML block, will return “core/freeform” instead of `null`. The `null` behavior matches
1737 * what {@see \parse_blocks()} returns but may not be as useful as having a string value.
1738 *
1739 * This function allocates a substring for the given block type. This
1740 * allocation will be small and likely fine in most cases, but it's
1741 * preferable to call {@see self::is_block_type()} if only needing
1742 * to know whether the delimiter is for a given block type, as that
1743 * function is more efficient for this purpose and avoids the allocation.
1744 *
1745 * Example:
1746 *
1747 * // Avoid.
1748 * 'core/paragraph' = $processor->get_printable_block_type();
1749 *
1750 * // Prefer.
1751 * $processor->is_block_type( 'core/paragraph' );
1752 * $processor->is_block_type( 'paragraph' );
1753 * $processor->is_block_type( 'core/freeform' );
1754 *
1755 * // Freeform HTML content is given an implicit type.
1756 * $processor = new WP_Block_Processor( 'non-block content' );
1757 * $processor->next_token();
1758 * 'core/freeform' === $processor->get_printable_block_type();
1759 *
1760 * @since 6.9.0
1761 *
1762 * @see self::are_equal_block_types()
1763 *
1764 * @return string|null Fully-qualified block namespace and type, e.g. "core/paragraph",
1765 * if matched on an explicit delimiter or freeform block, otherwise `null`.
1766 */
1767 public function get_printable_block_type(): ?string {
1768 if (
1769 self::READY === $this->state ||
1770 self::COMPLETE === $this->state ||
1771 self::INCOMPLETE_INPUT === $this->state
1772 ) {
1773 return null;
1774 }
1775
1776 // This is a core/freeform text block, it’s special.
1777 if ( $this->is_html() ) {
1778 return 1 === count( $this->open_blocks_at )
1779 ? 'core/freeform'
1780 : '#innerHTML';
1781 }
1782
1783 $block_type = substr( $this->source_text, $this->namespace_at, $this->name_at - $this->namespace_at + $this->name_length );
1784 return self::normalize_block_type( $block_type );
1785 }
1786
1787 /**
1788 * Normalizes a block name to ensure that missing implicit “core” namespaces are present.
1789 *
1790 * Example:
1791 *
1792 * 'core/paragraph' === WP_Block_Processor::normalize_block_byte( 'paragraph' );
1793 * 'core/paragraph' === WP_Block_Processor::normalize_block_byte( 'core/paragraph' );
1794 * 'my/paragraph' === WP_Block_Processor::normalize_block_byte( 'my/paragraph' );
1795 *
1796 * @since 6.9.0
1797 *
1798 * @param string $block_type Valid block name, potentially without a namespace.
1799 * @return string Fully-qualified block type including namespace.
1800 */
1801 public static function normalize_block_type( string $block_type ): string {
1802 return false === strpos( $block_type, '/' )
1803 ? "core/{$block_type}"
1804 : $block_type;
1805 }
1806
1807 /**
1808 * Returns a lazy wrapper around the block attributes, which can be used
1809 * for efficiently interacting with the JSON attributes.
1810 *
1811 * This stub hints that there should be a lazy interface for parsing
1812 * block attributes but doesn’t define it. It serves both as a placeholder
1813 * for one to come as well as a guard against implementing an eager
1814 * function in its place.
1815 *
1816 * @throws Exception This function is a stub for subclasses to implement
1817 * when providing streaming attribute parsing.
1818 *
1819 * @since 6.9.0
1820 *
1821 * @see self::allocate_and_return_parsed_attributes()
1822 *
1823 * @return never
1824 */
1825 public function get_attributes() {
1826 throw new Exception( 'Lazy attribute parsing not yet supported' );
1827 }
1828
1829 /**
1830 * Attempts to parse and return the entire JSON attributes from the delimiter,
1831 * allocating memory and processing the JSON span in the process.
1832 *
1833 * This does not return any parsed attributes for a closing block delimiter
1834 * even if there is a span of JSON content; this JSON is a parsing error.
1835 *
1836 * Consider calling {@see static::get_attributes()} instead if it's not
1837 * necessary to read all the attributes at the same time, as that provides
1838 * a more efficient mechanism for typical use cases.
1839 *
1840 * Since the JSON span inside the comment delimiter may not be valid JSON,
1841 * this function will return `null` if it cannot parse the span and set the
1842 * {@see static::get_last_json_error()} to the appropriate JSON_ERROR_ constant.
1843 *
1844 * If the delimiter contains no JSON span, it will also return `null`,
1845 * but the last error will be set to {@see \JSON_ERROR_NONE}.
1846 *
1847 * Example:
1848 *
1849 * $processor = new WP_Block_Processor( '<!-- wp:image {"url": "https://wordpress.org/favicon.ico"} -->' );
1850 * $processor->next_delimiter();
1851 * $memory_hungry_and_slow_attributes = $processor->allocate_and_return_parsed_attributes();
1852 * $memory_hungry_and_slow_attributes === array( 'url' => 'https://wordpress.org/favicon.ico' );
1853 *
1854 * $processor = new WP_Block_Processor( '<!-- /wp:image {"url": "https://wordpress.org/favicon.ico"} -->' );
1855 * $processor->next_delimiter();
1856 * null = $processor->allocate_and_return_parsed_attributes();
1857 * JSON_ERROR_NONE = $processor->get_last_json_error();
1858 *
1859 * $processor = new WP_Block_Processor( '<!-- wp:separator {} /-->' );
1860 * $processor->next_delimiter();
1861 * array() === $processor->allocate_and_return_parsed_attributes();
1862 *
1863 * $processor = new WP_Block_Processor( '<!-- wp:separator /-->' );
1864 * $processor->next_delimiter();
1865 * null = $processor->allocate_and_return_parsed_attributes();
1866 *
1867 * $processor = new WP_Block_Processor( '<!-- wp:image {"url} -->' );
1868 * $processor->next_delimiter();
1869 * null = $processor->allocate_and_return_parsed_attributes();
1870 * JSON_ERROR_CTRL_CHAR = $processor->get_last_json_error();
1871 *
1872 * @since 6.9.0
1873 *
1874 * @return array|null Parsed JSON attributes, if present and valid, otherwise `null`.
1875 */
1876 public function allocate_and_return_parsed_attributes(): ?array {
1877 $this->last_json_error = JSON_ERROR_NONE;
1878
1879 if ( self::CLOSER === $this->type || $this->is_html() || 0 === $this->json_length ) {
1880 return null;
1881 }
1882
1883 $json_span = substr( $this->source_text, $this->json_at, $this->json_length );
1884 $parsed = json_decode( $json_span, null, 512, JSON_OBJECT_AS_ARRAY | JSON_INVALID_UTF8_SUBSTITUTE );
1885
1886 $last_error = json_last_error();
1887 $this->last_json_error = $last_error;
1888
1889 return ( JSON_ERROR_NONE === $last_error && is_array( $parsed ) )
1890 ? $parsed
1891 : null;
1892 }
1893
1894 /**
1895 * Returns the span representing the currently-matched delimiter, if matched, otherwise `null`.
1896 *
1897 * Example:
1898 *
1899 * $processor = new WP_Block_Processor( '<!-- wp:void /-->' );
1900 * null === $processor->get_span();
1901 *
1902 * $processor->next_delimiter();
1903 * WP_HTML_Span( 0, 17 ) === $processor->get_span();
1904 *
1905 * @since 6.9.0
1906 *
1907 * @return WP_HTML_Span|null Span of text in source text spanning matched delimiter.
1908 */
1909 public function get_span(): ?WP_HTML_Span {
1910 switch ( $this->state ) {
1911 case self::HTML_SPAN:
1912 return new WP_HTML_Span( $this->after_previous_delimiter, $this->matched_delimiter_at - $this->after_previous_delimiter );
1913
1914 case self::MATCHED:
1915 return new WP_HTML_Span( $this->matched_delimiter_at, $this->matched_delimiter_length );
1916
1917 default:
1918 return null;
1919 }
1920 }
1921
1922 //
1923 // Constant declarations that would otherwise pollute the top of the class.
1924 //
1925
1926 /**
1927 * Indicates that the block comment delimiter closes an open block.
1928 *
1929 * @see self::$type
1930 *
1931 * @since 6.9.0
1932 */
1933 const CLOSER = 'closer';
1934
1935 /**
1936 * Indicates that the block comment delimiter opens a block.
1937 *
1938 * @see self::$type
1939 *
1940 * @since 6.9.0
1941 */
1942 const OPENER = 'opener';
1943
1944 /**
1945 * Indicates that the block comment delimiter represents a void block
1946 * with no inner content of any kind.
1947 *
1948 * @see self::$type
1949 *
1950 * @since 6.9.0
1951 */
1952 const VOID = 'void';
1953
1954 /**
1955 * Indicates that the processor is ready to start parsing but hasn’t yet begun.
1956 *
1957 * @see self::$state
1958 *
1959 * @since 6.9.0
1960 */
1961 const READY = 'processor-ready';
1962
1963 /**
1964 * Indicates that the processor is matched on an explicit block delimiter.
1965 *
1966 * @see self::$state
1967 *
1968 * @since 6.9.0
1969 */
1970 const MATCHED = 'processor-matched';
1971
1972 /**
1973 * Indicates that the processor is matched on the opening of an implicit freeform delimiter.
1974 *
1975 * @see self::$state
1976 *
1977 * @since 6.9.0
1978 */
1979 const HTML_SPAN = 'processor-html-span';
1980
1981 /**
1982 * Indicates that the parser started parsing a block comment delimiter, but
1983 * the input document ended before it could finish. The document was likely truncated.
1984 *
1985 * @see self::$state
1986 *
1987 * @since 6.9.0
1988 */
1989 const INCOMPLETE_INPUT = 'incomplete-input';
1990
1991 /**
1992 * Indicates that the processor has finished parsing and has nothing left to scan.
1993 *
1994 * @see self::$state
1995 *
1996 * @since 6.9.0
1997 */
1998 const COMPLETE = 'processor-complete';
1999}
2000