t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
#!/usr/bin/env perl
|
|
|
|
#
|
|
|
|
# Copyright (c) 2021-2022 Eric Sunshine <sunshine@sunshineco.com>
|
|
|
|
#
|
|
|
|
# This tool scans shell scripts for test definitions and checks those tests for
|
|
|
|
# problems, such as broken &&-chains, which might hide bugs in the tests
|
|
|
|
# themselves or in behaviors being exercised by the tests.
|
|
|
|
#
|
|
|
|
# Input arguments are pathnames of shell scripts containing test definitions,
|
|
|
|
# or globs referencing a collection of scripts. For each problem discovered,
|
|
|
|
# the pathname of the script containing the test is printed along with the test
|
|
|
|
# name and the test body with a `?!FOO?!` annotation at the location of each
|
|
|
|
# detected problem, where "FOO" is a tag such as "AMP" which indicates a broken
|
|
|
|
# &&-chain. Returns zero if no problems are discovered, otherwise non-zero.
|
|
|
|
|
|
|
|
use warnings;
|
|
|
|
use strict;
|
2022-09-01 02:29:44 +02:00
|
|
|
use Config;
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
use File::Glob;
|
|
|
|
use Getopt::Long;
|
|
|
|
|
2022-09-01 02:29:44 +02:00
|
|
|
my $jobs = -1;
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
my $show_stats;
|
|
|
|
my $emit_all;
|
|
|
|
|
2022-09-01 02:29:40 +02:00
|
|
|
# Lexer tokenizes POSIX shell scripts. It is roughly modeled after section 2.3
|
|
|
|
# "Token Recognition" of POSIX chapter 2 "Shell Command Language". Although
|
|
|
|
# similar to lexical analyzers for other languages, this one differs in a few
|
|
|
|
# substantial ways due to quirks of the shell command language.
|
|
|
|
#
|
|
|
|
# For instance, in many languages, newline is just whitespace like space or
|
|
|
|
# TAB, but in shell a newline is a command separator, thus a distinct lexical
|
|
|
|
# token. A newline is significant and returned as a distinct token even at the
|
|
|
|
# end of a shell comment.
|
|
|
|
#
|
|
|
|
# In other languages, `1+2` would typically be scanned as three tokens
|
|
|
|
# (`1`, `+`, and `2`), but in shell it is a single token. However, the similar
|
|
|
|
# `1 + 2`, which embeds whitepace, is scanned as three token in shell, as well.
|
|
|
|
# In shell, several characters with special meaning lose that meaning when not
|
|
|
|
# surrounded by whitespace. For instance, the negation operator `!` is special
|
|
|
|
# when standing alone surrounded by whitespace; whereas in `foo!uucp` it is
|
|
|
|
# just a plain character in the longer token "foo!uucp". In many other
|
|
|
|
# languages, `"string"/foo:'string'` might be scanned as five tokens ("string",
|
|
|
|
# `/`, `foo`, `:`, and 'string'), but in shell, it is just a single token.
|
|
|
|
#
|
|
|
|
# The lexical analyzer for the shell command language is also somewhat unusual
|
|
|
|
# in that it recursively invokes the parser to handle the body of `$(...)`
|
|
|
|
# expressions which can contain arbitrary shell code. Such expressions may be
|
|
|
|
# encountered both inside and outside of double-quoted strings.
|
|
|
|
#
|
|
|
|
# The lexical analyzer is responsible for consuming shell here-doc bodies which
|
|
|
|
# extend from the line following a `<<TAG` operator until a line consisting
|
|
|
|
# solely of `TAG`. Here-doc consumption begins when a newline is encountered.
|
|
|
|
# It is legal for multiple here-doc `<<TAG` operators to be present on a single
|
|
|
|
# line, in which case their bodies must be present one following the next, and
|
|
|
|
# are consumed in the (left-to-right) order the `<<TAG` operators appear on the
|
|
|
|
# line. A special complication is that the bodies of all here-docs must be
|
|
|
|
# consumed when the newline is encountered even if the parse context depth has
|
|
|
|
# changed. For instance, in `cat <<A && x=$(cat <<B &&\n`, bodies of here-docs
|
|
|
|
# "A" and "B" must be consumed even though "A" was introduced outside the
|
|
|
|
# recursive parse context in which "B" was introduced and in which the newline
|
|
|
|
# is encountered.
|
|
|
|
package Lexer;
|
|
|
|
|
|
|
|
sub new {
|
|
|
|
my ($class, $parser, $s) = @_;
|
|
|
|
bless {
|
|
|
|
parser => $parser,
|
|
|
|
buff => $s,
|
|
|
|
heretags => []
|
|
|
|
} => $class;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub scan_heredoc_tag {
|
|
|
|
my $self = shift @_;
|
|
|
|
${$self->{buff}} =~ /\G(-?)/gc;
|
|
|
|
my $indented = $1;
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
my $token = $self->scan_token();
|
|
|
|
return "<<$indented" unless $token;
|
|
|
|
my $tag = $token->[0];
|
2022-09-01 02:29:40 +02:00
|
|
|
$tag =~ s/['"\\]//g;
|
|
|
|
push(@{$self->{heretags}}, $indented ? "\t$tag" : "$tag");
|
|
|
|
return "<<$indented$tag";
|
|
|
|
}
|
|
|
|
|
|
|
|
sub scan_op {
|
|
|
|
my ($self, $c) = @_;
|
|
|
|
my $b = $self->{buff};
|
|
|
|
return $c unless $$b =~ /\G(.)/sgc;
|
|
|
|
my $cc = $c . $1;
|
|
|
|
return scan_heredoc_tag($self) if $cc eq '<<';
|
|
|
|
return $cc if $cc =~ /^(?:&&|\|\||>>|;;|<&|>&|<>|>\|)$/;
|
|
|
|
pos($$b)--;
|
|
|
|
return $c;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub scan_sqstring {
|
|
|
|
my $self = shift @_;
|
|
|
|
${$self->{buff}} =~ /\G([^']*'|.*\z)/sgc;
|
|
|
|
return "'" . $1;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub scan_dqstring {
|
|
|
|
my $self = shift @_;
|
|
|
|
my $b = $self->{buff};
|
|
|
|
my $s = '"';
|
|
|
|
while (1) {
|
|
|
|
# slurp up non-special characters
|
|
|
|
$s .= $1 if $$b =~ /\G([^"\$\\]+)/gc;
|
|
|
|
# handle special characters
|
|
|
|
last unless $$b =~ /\G(.)/sgc;
|
|
|
|
my $c = $1;
|
|
|
|
$s .= '"', last if $c eq '"';
|
|
|
|
$s .= '$' . $self->scan_dollar(), next if $c eq '$';
|
|
|
|
if ($c eq '\\') {
|
|
|
|
$s .= '\\', last unless $$b =~ /\G(.)/sgc;
|
|
|
|
$c = $1;
|
|
|
|
next if $c eq "\n"; # line splice
|
|
|
|
# backslash escapes only $, `, ", \ in dq-string
|
|
|
|
$s .= '\\' unless $c =~ /^[\$`"\\]$/;
|
|
|
|
$s .= $c;
|
|
|
|
next;
|
|
|
|
}
|
|
|
|
die("internal error scanning dq-string '$c'\n");
|
|
|
|
}
|
|
|
|
return $s;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub scan_balanced {
|
|
|
|
my ($self, $c1, $c2) = @_;
|
|
|
|
my $b = $self->{buff};
|
|
|
|
my $depth = 1;
|
|
|
|
my $s = $c1;
|
|
|
|
while ($$b =~ /\G([^\Q$c1$c2\E]*(?:[\Q$c1$c2\E]|\z))/gc) {
|
|
|
|
$s .= $1;
|
|
|
|
$depth++, next if $s =~ /\Q$c1\E$/;
|
|
|
|
$depth--;
|
|
|
|
last if $depth == 0;
|
|
|
|
}
|
|
|
|
return $s;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub scan_subst {
|
|
|
|
my $self = shift @_;
|
|
|
|
my @tokens = $self->{parser}->parse(qr/^\)$/);
|
|
|
|
$self->{parser}->next_token(); # closing ")"
|
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub scan_dollar {
|
|
|
|
my $self = shift @_;
|
|
|
|
my $b = $self->{buff};
|
|
|
|
return $self->scan_balanced('(', ')') if $$b =~ /\G\((?=\()/gc; # $((...))
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
return '(' . join(' ', map {$_->[0]} $self->scan_subst()) . ')' if $$b =~ /\G\(/gc; # $(...)
|
2022-09-01 02:29:40 +02:00
|
|
|
return $self->scan_balanced('{', '}') if $$b =~ /\G\{/gc; # ${...}
|
|
|
|
return $1 if $$b =~ /\G(\w+)/gc; # $var
|
|
|
|
return $1 if $$b =~ /\G([@*#?$!0-9-])/gc; # $*, $1, $$, etc.
|
|
|
|
return '';
|
|
|
|
}
|
|
|
|
|
|
|
|
sub swallow_heredocs {
|
|
|
|
my $self = shift @_;
|
|
|
|
my $b = $self->{buff};
|
|
|
|
my $tags = $self->{heretags};
|
|
|
|
while (my $tag = shift @$tags) {
|
|
|
|
my $indent = $tag =~ s/^\t// ? '\\s*' : '';
|
|
|
|
$$b =~ /(?:\G|\n)$indent\Q$tag\E(?:\n|\z)/gc;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
sub scan_token {
|
|
|
|
my $self = shift @_;
|
|
|
|
my $b = $self->{buff};
|
|
|
|
my $token = '';
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
my $start;
|
2022-09-01 02:29:40 +02:00
|
|
|
RESTART:
|
|
|
|
$$b =~ /\G[ \t]+/gc; # skip whitespace (but not newline)
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
$start = pos($$b) || 0;
|
|
|
|
return ["\n", $start, pos($$b)] if $$b =~ /\G#[^\n]*(?:\n|\z)/gc; # comment
|
2022-09-01 02:29:40 +02:00
|
|
|
while (1) {
|
|
|
|
# slurp up non-special characters
|
|
|
|
$token .= $1 if $$b =~ /\G([^\\;&|<>(){}'"\$\s]+)/gc;
|
|
|
|
# handle special characters
|
|
|
|
last unless $$b =~ /\G(.)/sgc;
|
|
|
|
my $c = $1;
|
chainlint: tighten accuracy when consuming input stream
To extract the next token in the input stream, Lexer::scan_token() finds
the start of the token by skipping whitespace, then consumes characters
belonging to the token until it encounters a non-token character, such
as an operator, punctuation, or whitespace. In the case of an operator
or punctuation which ends a token, before returning the just-scanned
token, it pushes that operator or punctuation character back onto the
input stream to ensure that it will be the first character consumed by
the next call to scan_token().
However, scan_token() is intentionally lax when whitespace ends a token;
it doesn't bother pushing the whitespace character back onto the token
stream since it knows that the next call to scan_token() will, as its
first step, skip over whitespace anyhow when looking for the start of
the token.
Although such laxity is harmless for the proper functioning of the
lexical analyzer, it does make it difficult to precisely identify the
token's end position in the input stream. Accurate token position
information may be desirable, for instance, to annotate problems or
highlight other interesting facets of the input found during the parsing
phase. To accommodate such possibilities, tighten scan_token() by making
it push the token-ending whitespace character back onto the input
stream, just as it does for other token-ending characters.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:28 +01:00
|
|
|
pos($$b)--, last if $c =~ /^[ \t]$/; # whitespace ends token
|
2022-09-01 02:29:40 +02:00
|
|
|
pos($$b)--, last if length($token) && $c =~ /^[;&|<>(){}\n]$/;
|
|
|
|
$token .= $self->scan_sqstring(), next if $c eq "'";
|
|
|
|
$token .= $self->scan_dqstring(), next if $c eq '"';
|
|
|
|
$token .= $c . $self->scan_dollar(), next if $c eq '$';
|
|
|
|
$self->swallow_heredocs(), $token = $c, last if $c eq "\n";
|
|
|
|
$token = $self->scan_op($c), last if $c =~ /^[;&|<>]$/;
|
|
|
|
$token = $c, last if $c =~ /^[(){}]$/;
|
|
|
|
if ($c eq '\\') {
|
|
|
|
$token .= '\\', last unless $$b =~ /\G(.)/sgc;
|
|
|
|
$c = $1;
|
|
|
|
next if $c eq "\n" && length($token); # line splice
|
|
|
|
goto RESTART if $c eq "\n"; # line splice
|
|
|
|
$token .= '\\' . $c;
|
|
|
|
next;
|
|
|
|
}
|
|
|
|
die("internal error scanning character '$c'\n");
|
|
|
|
}
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
return length($token) ? [$token, $start, pos($$b)] : undef;
|
2022-09-01 02:29:40 +02:00
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:41 +02:00
|
|
|
# ShellParser parses POSIX shell scripts (with minor extensions for Bash). It
|
|
|
|
# is a recursive descent parser very roughly modeled after section 2.10 "Shell
|
|
|
|
# Grammar" of POSIX chapter 2 "Shell Command Language".
|
|
|
|
package ShellParser;
|
|
|
|
|
|
|
|
sub new {
|
|
|
|
my ($class, $s) = @_;
|
|
|
|
my $self = bless {
|
|
|
|
buff => [],
|
|
|
|
stop => [],
|
|
|
|
output => []
|
|
|
|
} => $class;
|
|
|
|
$self->{lexer} = Lexer->new($self, $s);
|
|
|
|
return $self;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub next_token {
|
|
|
|
my $self = shift @_;
|
|
|
|
return pop(@{$self->{buff}}) if @{$self->{buff}};
|
|
|
|
return $self->{lexer}->scan_token();
|
|
|
|
}
|
|
|
|
|
|
|
|
sub untoken {
|
|
|
|
my $self = shift @_;
|
|
|
|
push(@{$self->{buff}}, @_);
|
|
|
|
}
|
|
|
|
|
|
|
|
sub peek {
|
|
|
|
my $self = shift @_;
|
|
|
|
my $token = $self->next_token();
|
|
|
|
return undef unless defined($token);
|
|
|
|
$self->untoken($token);
|
|
|
|
return $token;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub stop_at {
|
|
|
|
my ($self, $token) = @_;
|
|
|
|
return 1 unless defined($token);
|
|
|
|
my $stop = ${$self->{stop}}[-1] if @{$self->{stop}};
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
return defined($stop) && $token->[0] =~ $stop;
|
2022-09-01 02:29:41 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
sub expect {
|
|
|
|
my ($self, $expect) = @_;
|
|
|
|
my $token = $self->next_token();
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
return $token if defined($token) && $token->[0] eq $expect;
|
|
|
|
push(@{$self->{output}}, "?!ERR?! expected '$expect' but found '" . (defined($token) ? $token->[0] : "<end-of-input>") . "'\n");
|
2022-09-01 02:29:41 +02:00
|
|
|
$self->untoken($token) if defined($token);
|
|
|
|
return ();
|
|
|
|
}
|
|
|
|
|
|
|
|
sub optional_newlines {
|
|
|
|
my $self = shift @_;
|
|
|
|
my @tokens;
|
|
|
|
while (my $token = $self->peek()) {
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
last unless $token->[0] eq "\n";
|
2022-09-01 02:29:41 +02:00
|
|
|
push(@tokens, $self->next_token());
|
|
|
|
}
|
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_group {
|
|
|
|
my $self = shift @_;
|
|
|
|
return ($self->parse(qr/^}$/),
|
|
|
|
$self->expect('}'));
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_subshell {
|
|
|
|
my $self = shift @_;
|
|
|
|
return ($self->parse(qr/^\)$/),
|
|
|
|
$self->expect(')'));
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_case_pattern {
|
|
|
|
my $self = shift @_;
|
|
|
|
my @tokens;
|
|
|
|
while (defined(my $token = $self->next_token())) {
|
|
|
|
push(@tokens, $token);
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
last if $token->[0] eq ')';
|
2022-09-01 02:29:41 +02:00
|
|
|
}
|
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_case {
|
|
|
|
my $self = shift @_;
|
|
|
|
my @tokens;
|
|
|
|
push(@tokens,
|
|
|
|
$self->next_token(), # subject
|
|
|
|
$self->optional_newlines(),
|
|
|
|
$self->expect('in'),
|
|
|
|
$self->optional_newlines());
|
|
|
|
while (1) {
|
|
|
|
my $token = $self->peek();
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
last unless defined($token) && $token->[0] ne 'esac';
|
2022-09-01 02:29:41 +02:00
|
|
|
push(@tokens,
|
|
|
|
$self->parse_case_pattern(),
|
|
|
|
$self->optional_newlines(),
|
|
|
|
$self->parse(qr/^(?:;;|esac)$/)); # item body
|
|
|
|
$token = $self->peek();
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
last unless defined($token) && $token->[0] ne 'esac';
|
2022-09-01 02:29:41 +02:00
|
|
|
push(@tokens,
|
|
|
|
$self->expect(';;'),
|
|
|
|
$self->optional_newlines());
|
|
|
|
}
|
|
|
|
push(@tokens, $self->expect('esac'));
|
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_for {
|
|
|
|
my $self = shift @_;
|
|
|
|
my @tokens;
|
|
|
|
push(@tokens,
|
|
|
|
$self->next_token(), # variable
|
|
|
|
$self->optional_newlines());
|
|
|
|
my $token = $self->peek();
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
if (defined($token) && $token->[0] eq 'in') {
|
2022-09-01 02:29:41 +02:00
|
|
|
push(@tokens,
|
|
|
|
$self->expect('in'),
|
|
|
|
$self->optional_newlines());
|
|
|
|
}
|
|
|
|
push(@tokens,
|
|
|
|
$self->parse(qr/^do$/), # items
|
|
|
|
$self->expect('do'),
|
|
|
|
$self->optional_newlines(),
|
|
|
|
$self->parse_loop_body(),
|
|
|
|
$self->expect('done'));
|
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_if {
|
|
|
|
my $self = shift @_;
|
|
|
|
my @tokens;
|
|
|
|
while (1) {
|
|
|
|
push(@tokens,
|
|
|
|
$self->parse(qr/^then$/), # if/elif condition
|
|
|
|
$self->expect('then'),
|
|
|
|
$self->optional_newlines(),
|
|
|
|
$self->parse(qr/^(?:elif|else|fi)$/)); # if/elif body
|
|
|
|
my $token = $self->peek();
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
last unless defined($token) && $token->[0] eq 'elif';
|
2022-09-01 02:29:41 +02:00
|
|
|
push(@tokens, $self->expect('elif'));
|
|
|
|
}
|
|
|
|
my $token = $self->peek();
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
if (defined($token) && $token->[0] eq 'else') {
|
2022-09-01 02:29:41 +02:00
|
|
|
push(@tokens,
|
|
|
|
$self->expect('else'),
|
|
|
|
$self->optional_newlines(),
|
|
|
|
$self->parse(qr/^fi$/)); # else body
|
|
|
|
}
|
|
|
|
push(@tokens, $self->expect('fi'));
|
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_loop_body {
|
|
|
|
my $self = shift @_;
|
|
|
|
return $self->parse(qr/^done$/);
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_loop {
|
|
|
|
my $self = shift @_;
|
|
|
|
return ($self->parse(qr/^do$/), # condition
|
|
|
|
$self->expect('do'),
|
|
|
|
$self->optional_newlines(),
|
|
|
|
$self->parse_loop_body(),
|
|
|
|
$self->expect('done'));
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_func {
|
|
|
|
my $self = shift @_;
|
|
|
|
return ($self->expect('('),
|
|
|
|
$self->expect(')'),
|
|
|
|
$self->optional_newlines(),
|
|
|
|
$self->parse_cmd()); # body
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse_bash_array_assignment {
|
|
|
|
my $self = shift @_;
|
|
|
|
my @tokens = $self->expect('(');
|
|
|
|
while (defined(my $token = $self->next_token())) {
|
|
|
|
push(@tokens, $token);
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
last if $token->[0] eq ')';
|
2022-09-01 02:29:41 +02:00
|
|
|
}
|
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
|
|
|
my %compound = (
|
|
|
|
'{' => \&parse_group,
|
|
|
|
'(' => \&parse_subshell,
|
|
|
|
'case' => \&parse_case,
|
|
|
|
'for' => \&parse_for,
|
|
|
|
'if' => \&parse_if,
|
|
|
|
'until' => \&parse_loop,
|
|
|
|
'while' => \&parse_loop);
|
|
|
|
|
|
|
|
sub parse_cmd {
|
|
|
|
my $self = shift @_;
|
|
|
|
my $cmd = $self->next_token();
|
|
|
|
return () unless defined($cmd);
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
return $cmd if $cmd->[0] eq "\n";
|
2022-09-01 02:29:41 +02:00
|
|
|
|
|
|
|
my $token;
|
|
|
|
my @tokens = $cmd;
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
if ($cmd->[0] eq '!') {
|
2022-09-01 02:29:41 +02:00
|
|
|
push(@tokens, $self->parse_cmd());
|
|
|
|
return @tokens;
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
} elsif (my $f = $compound{$cmd->[0]}) {
|
2022-09-01 02:29:41 +02:00
|
|
|
push(@tokens, $self->$f());
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
} elsif (defined($token = $self->peek()) && $token->[0] eq '(') {
|
|
|
|
if ($cmd->[0] !~ /\w=$/) {
|
2022-09-01 02:29:41 +02:00
|
|
|
push(@tokens, $self->parse_func());
|
|
|
|
return @tokens;
|
|
|
|
}
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
my @array = $self->parse_bash_array_assignment();
|
|
|
|
$tokens[-1]->[0] .= join(' ', map {$_->[0]} @array);
|
|
|
|
$tokens[-1]->[2] = $array[$#array][2] if @array;
|
2022-09-01 02:29:41 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
while (defined(my $token = $self->next_token())) {
|
|
|
|
$self->untoken($token), last if $self->stop_at($token);
|
|
|
|
push(@tokens, $token);
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
last if $token->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/;
|
2022-09-01 02:29:41 +02:00
|
|
|
}
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
push(@tokens, $self->next_token()) if $tokens[-1]->[0] ne "\n" && defined($token = $self->peek()) && $token->[0] eq "\n";
|
2022-09-01 02:29:41 +02:00
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub accumulate {
|
|
|
|
my ($self, $tokens, $cmd) = @_;
|
|
|
|
push(@$tokens, @$cmd);
|
|
|
|
}
|
|
|
|
|
|
|
|
sub parse {
|
|
|
|
my ($self, $stop) = @_;
|
|
|
|
push(@{$self->{stop}}, $stop);
|
|
|
|
goto DONE if $self->stop_at($self->peek());
|
|
|
|
my @tokens;
|
|
|
|
while (my @cmd = $self->parse_cmd()) {
|
|
|
|
$self->accumulate(\@tokens, \@cmd);
|
|
|
|
last if $self->stop_at($self->peek());
|
|
|
|
}
|
|
|
|
DONE:
|
|
|
|
pop(@{$self->{stop}});
|
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:42 +02:00
|
|
|
# TestParser is a subclass of ShellParser which, beyond parsing shell script
|
|
|
|
# code, is also imbued with semantic knowledge of test construction, and checks
|
|
|
|
# tests for common problems (such as broken &&-chains) which might hide bugs in
|
|
|
|
# the tests themselves or in behaviors being exercised by the tests. As such,
|
|
|
|
# TestParser is only called upon to parse test bodies, not the top-level
|
|
|
|
# scripts in which the tests are defined.
|
|
|
|
package TestParser;
|
|
|
|
|
|
|
|
use base 'ShellParser';
|
|
|
|
|
chainlint: annotate original test definition rather than token stream
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
However, now that each parsed token carries positional information, the
location of a detected problem can be pinpointed precisely in the
original test definition. Therefore, take advantage of this information
to annotate the test definition itself rather than annotating the parsed
token stream, thus making it easier for a test author to relate a
problem back to the source.
Maintaining the positional meta-information associated with each
detected problem requires a slight change in how the problems are
managed internally. In particular, shell syntax such as:
msg="total: $(cd data; wc -w *.txt) words"
requires the lexical analyzer to recursively invoke the parser in order
to detect problems within the $(...) expression inside the double-quoted
string. In this case, the recursive parse context will detect the broken
&&-chain between the `cd` and `wc` commands, returning the token stream:
cd data ; ?!AMP?! wc -w *.txt
However, the parent parse context will see everything inside the
double-quotes as a single string token:
"total: $(cd data ; ?!AMP?! wc -w *.txt) words"
losing whatever positional information was attached to the ";" token
where the problem was detected.
One way to preserve the positional information of a detected problem in
a recursive parse context within a string would be to attach the
positional information to the annotation textually; for instance:
"total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words"
and then extract the positional information when annotating the original
test definition.
However, a cleaner and much simpler approach is to maintain the list of
detected problems separately rather than embedding the problems as
annotations directly in the parsed token stream. Not only does this
ensure that positional information within recursive parse contexts is
not lost, but it keeps the token stream free from non-token pollution,
which may simplify implementation of validations added in the future
since they won't have to handle non-token "?!FOO!?" items specially.
Finally, the chainlint self-test "expect" files need a few mechanical
adjustments now that the original test definitions are emitted rather
than the parsed token stream. In particular, the following items missing
from the historic parsed-token output are now preserved verbatim:
* indentation (and whitespace, in general)
* comments
* here-doc bodies
* here-doc tag quoting (i.e. "\EOF")
* line-splices (i.e. "\" at the end of a line)
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:30 +01:00
|
|
|
sub new {
|
|
|
|
my $class = shift @_;
|
|
|
|
my $self = $class->SUPER::new(@_);
|
|
|
|
$self->{problems} = [];
|
|
|
|
return $self;
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:42 +02:00
|
|
|
sub find_non_nl {
|
|
|
|
my $tokens = shift @_;
|
|
|
|
my $n = shift @_;
|
|
|
|
$n = $#$tokens if !defined($n);
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
$n-- while $n >= 0 && $$tokens[$n]->[0] eq "\n";
|
2022-09-01 02:29:42 +02:00
|
|
|
return $n;
|
|
|
|
}
|
|
|
|
|
|
|
|
sub ends_with {
|
|
|
|
my ($tokens, $needles) = @_;
|
|
|
|
my $n = find_non_nl($tokens);
|
|
|
|
for my $needle (reverse(@$needles)) {
|
|
|
|
return undef if $n < 0;
|
|
|
|
$n = find_non_nl($tokens, $n), next if $needle eq "\n";
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
return undef if $$tokens[$n]->[0] !~ $needle;
|
2022-09-01 02:29:42 +02:00
|
|
|
$n--;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:45 +02:00
|
|
|
sub match_ending {
|
|
|
|
my ($tokens, $endings) = @_;
|
|
|
|
for my $needles (@$endings) {
|
|
|
|
next if @$tokens < scalar(grep {$_ ne "\n"} @$needles);
|
|
|
|
return 1 if ends_with($tokens, $needles);
|
|
|
|
}
|
|
|
|
return undef;
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:50 +02:00
|
|
|
sub parse_loop_body {
|
|
|
|
my $self = shift @_;
|
|
|
|
my @tokens = $self->SUPER::parse_loop_body(@_);
|
|
|
|
# did loop signal failure via "|| return" or "|| exit"?
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
return @tokens if !@tokens || grep {$_->[0] =~ /^(?:return|exit|\$\?)$/} @tokens;
|
2022-09-01 02:29:51 +02:00
|
|
|
# did loop upstream of a pipe signal failure via "|| echo 'impossible
|
|
|
|
# text'" as the final command in the loop body?
|
|
|
|
return @tokens if ends_with(\@tokens, [qr/^\|\|$/, "\n", qr/^echo$/, qr/^.+$/]);
|
2022-09-01 02:29:50 +02:00
|
|
|
# flag missing "return/exit" handling explicit failure in loop body
|
|
|
|
my $n = find_non_nl(\@tokens);
|
chainlint: annotate original test definition rather than token stream
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
However, now that each parsed token carries positional information, the
location of a detected problem can be pinpointed precisely in the
original test definition. Therefore, take advantage of this information
to annotate the test definition itself rather than annotating the parsed
token stream, thus making it easier for a test author to relate a
problem back to the source.
Maintaining the positional meta-information associated with each
detected problem requires a slight change in how the problems are
managed internally. In particular, shell syntax such as:
msg="total: $(cd data; wc -w *.txt) words"
requires the lexical analyzer to recursively invoke the parser in order
to detect problems within the $(...) expression inside the double-quoted
string. In this case, the recursive parse context will detect the broken
&&-chain between the `cd` and `wc` commands, returning the token stream:
cd data ; ?!AMP?! wc -w *.txt
However, the parent parse context will see everything inside the
double-quotes as a single string token:
"total: $(cd data ; ?!AMP?! wc -w *.txt) words"
losing whatever positional information was attached to the ";" token
where the problem was detected.
One way to preserve the positional information of a detected problem in
a recursive parse context within a string would be to attach the
positional information to the annotation textually; for instance:
"total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words"
and then extract the positional information when annotating the original
test definition.
However, a cleaner and much simpler approach is to maintain the list of
detected problems separately rather than embedding the problems as
annotations directly in the parsed token stream. Not only does this
ensure that positional information within recursive parse contexts is
not lost, but it keeps the token stream free from non-token pollution,
which may simplify implementation of validations added in the future
since they won't have to handle non-token "?!FOO!?" items specially.
Finally, the chainlint self-test "expect" files need a few mechanical
adjustments now that the original test definitions are emitted rather
than the parsed token stream. In particular, the following items missing
from the historic parsed-token output are now preserved verbatim:
* indentation (and whitespace, in general)
* comments
* here-doc bodies
* here-doc tag quoting (i.e. "\EOF")
* line-splices (i.e. "\" at the end of a line)
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:30 +01:00
|
|
|
push(@{$self->{problems}}, ['LOOP', $tokens[$n]]);
|
2022-09-01 02:29:50 +02:00
|
|
|
return @tokens;
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:45 +02:00
|
|
|
my @safe_endings = (
|
2022-09-01 02:29:47 +02:00
|
|
|
[qr/^(?:&&|\|\||\||&)$/],
|
2022-09-01 02:29:45 +02:00
|
|
|
[qr/^(?:exit|return)$/, qr/^(?:\d+|\$\?)$/],
|
|
|
|
[qr/^(?:exit|return)$/, qr/^(?:\d+|\$\?)$/, qr/^;$/],
|
|
|
|
[qr/^(?:exit|return|continue)$/],
|
|
|
|
[qr/^(?:exit|return|continue)$/, qr/^;$/]);
|
|
|
|
|
2022-09-01 02:29:42 +02:00
|
|
|
sub accumulate {
|
|
|
|
my ($self, $tokens, $cmd) = @_;
|
chainlint: annotate original test definition rather than token stream
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
However, now that each parsed token carries positional information, the
location of a detected problem can be pinpointed precisely in the
original test definition. Therefore, take advantage of this information
to annotate the test definition itself rather than annotating the parsed
token stream, thus making it easier for a test author to relate a
problem back to the source.
Maintaining the positional meta-information associated with each
detected problem requires a slight change in how the problems are
managed internally. In particular, shell syntax such as:
msg="total: $(cd data; wc -w *.txt) words"
requires the lexical analyzer to recursively invoke the parser in order
to detect problems within the $(...) expression inside the double-quoted
string. In this case, the recursive parse context will detect the broken
&&-chain between the `cd` and `wc` commands, returning the token stream:
cd data ; ?!AMP?! wc -w *.txt
However, the parent parse context will see everything inside the
double-quotes as a single string token:
"total: $(cd data ; ?!AMP?! wc -w *.txt) words"
losing whatever positional information was attached to the ";" token
where the problem was detected.
One way to preserve the positional information of a detected problem in
a recursive parse context within a string would be to attach the
positional information to the annotation textually; for instance:
"total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words"
and then extract the positional information when annotating the original
test definition.
However, a cleaner and much simpler approach is to maintain the list of
detected problems separately rather than embedding the problems as
annotations directly in the parsed token stream. Not only does this
ensure that positional information within recursive parse contexts is
not lost, but it keeps the token stream free from non-token pollution,
which may simplify implementation of validations added in the future
since they won't have to handle non-token "?!FOO!?" items specially.
Finally, the chainlint self-test "expect" files need a few mechanical
adjustments now that the original test definitions are emitted rather
than the parsed token stream. In particular, the following items missing
from the historic parsed-token output are now preserved verbatim:
* indentation (and whitespace, in general)
* comments
* here-doc bodies
* here-doc tag quoting (i.e. "\EOF")
* line-splices (i.e. "\" at the end of a line)
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:30 +01:00
|
|
|
my $problems = $self->{problems};
|
2022-11-08 20:08:27 +01:00
|
|
|
|
|
|
|
# no previous command to check for missing "&&"
|
2022-09-01 02:29:42 +02:00
|
|
|
goto DONE unless @$tokens;
|
2022-11-08 20:08:27 +01:00
|
|
|
|
|
|
|
# new command is empty line; can't yet check if previous is missing "&&"
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
goto DONE if @$cmd == 1 && $$cmd[0]->[0] eq "\n";
|
2022-09-01 02:29:42 +02:00
|
|
|
|
2022-09-01 02:29:45 +02:00
|
|
|
# did previous command end with "&&", "|", "|| return" or similar?
|
|
|
|
goto DONE if match_ending($tokens, \@safe_endings);
|
2022-09-01 02:29:42 +02:00
|
|
|
|
2022-09-01 02:29:48 +02:00
|
|
|
# if this command handles "$?" specially, then okay for previous
|
|
|
|
# command to be missing "&&"
|
|
|
|
for my $token (@$cmd) {
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
goto DONE if $token->[0] =~ /\$\?/;
|
2022-09-01 02:29:48 +02:00
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:49 +02:00
|
|
|
# if this command is "false", "return 1", or "exit 1" (which signal
|
|
|
|
# failure explicitly), then okay for all preceding commands to be
|
|
|
|
# missing "&&"
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
if ($$cmd[0]->[0] =~ /^(?:false|return|exit)$/) {
|
chainlint: annotate original test definition rather than token stream
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
However, now that each parsed token carries positional information, the
location of a detected problem can be pinpointed precisely in the
original test definition. Therefore, take advantage of this information
to annotate the test definition itself rather than annotating the parsed
token stream, thus making it easier for a test author to relate a
problem back to the source.
Maintaining the positional meta-information associated with each
detected problem requires a slight change in how the problems are
managed internally. In particular, shell syntax such as:
msg="total: $(cd data; wc -w *.txt) words"
requires the lexical analyzer to recursively invoke the parser in order
to detect problems within the $(...) expression inside the double-quoted
string. In this case, the recursive parse context will detect the broken
&&-chain between the `cd` and `wc` commands, returning the token stream:
cd data ; ?!AMP?! wc -w *.txt
However, the parent parse context will see everything inside the
double-quotes as a single string token:
"total: $(cd data ; ?!AMP?! wc -w *.txt) words"
losing whatever positional information was attached to the ";" token
where the problem was detected.
One way to preserve the positional information of a detected problem in
a recursive parse context within a string would be to attach the
positional information to the annotation textually; for instance:
"total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words"
and then extract the positional information when annotating the original
test definition.
However, a cleaner and much simpler approach is to maintain the list of
detected problems separately rather than embedding the problems as
annotations directly in the parsed token stream. Not only does this
ensure that positional information within recursive parse contexts is
not lost, but it keeps the token stream free from non-token pollution,
which may simplify implementation of validations added in the future
since they won't have to handle non-token "?!FOO!?" items specially.
Finally, the chainlint self-test "expect" files need a few mechanical
adjustments now that the original test definitions are emitted rather
than the parsed token stream. In particular, the following items missing
from the historic parsed-token output are now preserved verbatim:
* indentation (and whitespace, in general)
* comments
* here-doc bodies
* here-doc tag quoting (i.e. "\EOF")
* line-splices (i.e. "\" at the end of a line)
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:30 +01:00
|
|
|
@$problems = grep {$_->[0] ne 'AMP'} @$problems;
|
2022-09-01 02:29:49 +02:00
|
|
|
goto DONE;
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:42 +02:00
|
|
|
# flag missing "&&" at end of previous command
|
|
|
|
my $n = find_non_nl($tokens);
|
chainlint: annotate original test definition rather than token stream
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
However, now that each parsed token carries positional information, the
location of a detected problem can be pinpointed precisely in the
original test definition. Therefore, take advantage of this information
to annotate the test definition itself rather than annotating the parsed
token stream, thus making it easier for a test author to relate a
problem back to the source.
Maintaining the positional meta-information associated with each
detected problem requires a slight change in how the problems are
managed internally. In particular, shell syntax such as:
msg="total: $(cd data; wc -w *.txt) words"
requires the lexical analyzer to recursively invoke the parser in order
to detect problems within the $(...) expression inside the double-quoted
string. In this case, the recursive parse context will detect the broken
&&-chain between the `cd` and `wc` commands, returning the token stream:
cd data ; ?!AMP?! wc -w *.txt
However, the parent parse context will see everything inside the
double-quotes as a single string token:
"total: $(cd data ; ?!AMP?! wc -w *.txt) words"
losing whatever positional information was attached to the ";" token
where the problem was detected.
One way to preserve the positional information of a detected problem in
a recursive parse context within a string would be to attach the
positional information to the annotation textually; for instance:
"total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words"
and then extract the positional information when annotating the original
test definition.
However, a cleaner and much simpler approach is to maintain the list of
detected problems separately rather than embedding the problems as
annotations directly in the parsed token stream. Not only does this
ensure that positional information within recursive parse contexts is
not lost, but it keeps the token stream free from non-token pollution,
which may simplify implementation of validations added in the future
since they won't have to handle non-token "?!FOO!?" items specially.
Finally, the chainlint self-test "expect" files need a few mechanical
adjustments now that the original test definitions are emitted rather
than the parsed token stream. In particular, the following items missing
from the historic parsed-token output are now preserved verbatim:
* indentation (and whitespace, in general)
* comments
* here-doc bodies
* here-doc tag quoting (i.e. "\EOF")
* line-splices (i.e. "\" at the end of a line)
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:30 +01:00
|
|
|
push(@$problems, ['AMP', $tokens->[$n]]) unless $n < 0;
|
2022-09-01 02:29:42 +02:00
|
|
|
|
|
|
|
DONE:
|
|
|
|
$self->SUPER::accumulate($tokens, $cmd);
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:43 +02:00
|
|
|
# ScriptParser is a subclass of ShellParser which identifies individual test
|
|
|
|
# definitions within test scripts, and passes each test body through TestParser
|
|
|
|
# to identify possible problems. ShellParser detects test definitions not only
|
|
|
|
# at the top-level of test scripts but also within compound commands such as
|
|
|
|
# loops and function definitions.
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
package ScriptParser;
|
|
|
|
|
2022-09-01 02:29:43 +02:00
|
|
|
use base 'ShellParser';
|
|
|
|
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
sub new {
|
|
|
|
my $class = shift @_;
|
2022-09-01 02:29:43 +02:00
|
|
|
my $self = $class->SUPER::new(@_);
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
$self->{ntests} = 0;
|
|
|
|
return $self;
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:43 +02:00
|
|
|
# extract the raw content of a token, which may be a single string or a
|
|
|
|
# composition of multiple strings and non-string character runs; for instance,
|
|
|
|
# `"test body"` unwraps to `test body`; `word"a b"42'c d'` to `worda b42c d`
|
|
|
|
sub unwrap {
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
my $token = (@_ ? shift @_ : $_)->[0];
|
2022-09-01 02:29:43 +02:00
|
|
|
# simple case: 'sqstring' or "dqstring"
|
|
|
|
return $token if $token =~ s/^'([^']*)'$/$1/;
|
|
|
|
return $token if $token =~ s/^"([^"]*)"$/$1/;
|
|
|
|
|
|
|
|
# composite case
|
|
|
|
my ($s, $q, $escaped);
|
|
|
|
while (1) {
|
|
|
|
# slurp up non-special characters
|
|
|
|
$s .= $1 if $token =~ /\G([^\\'"]*)/gc;
|
|
|
|
# handle special characters
|
|
|
|
last unless $token =~ /\G(.)/sgc;
|
|
|
|
my $c = $1;
|
|
|
|
$q = undef, next if defined($q) && $c eq $q;
|
|
|
|
$q = $c, next if !defined($q) && $c =~ /^['"]$/;
|
|
|
|
if ($c eq '\\') {
|
|
|
|
last unless $token =~ /\G(.)/sgc;
|
|
|
|
$c = $1;
|
|
|
|
$s .= '\\' if $c eq "\n"; # preserve line splice
|
|
|
|
}
|
|
|
|
$s .= $c;
|
|
|
|
}
|
|
|
|
return $s
|
|
|
|
}
|
|
|
|
|
|
|
|
sub check_test {
|
|
|
|
my $self = shift @_;
|
|
|
|
my ($title, $body) = map(unwrap, @_);
|
|
|
|
$self->{ntests}++;
|
|
|
|
my $parser = TestParser->new(\$body);
|
|
|
|
my @tokens = $parser->parse();
|
chainlint: annotate original test definition rather than token stream
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
However, now that each parsed token carries positional information, the
location of a detected problem can be pinpointed precisely in the
original test definition. Therefore, take advantage of this information
to annotate the test definition itself rather than annotating the parsed
token stream, thus making it easier for a test author to relate a
problem back to the source.
Maintaining the positional meta-information associated with each
detected problem requires a slight change in how the problems are
managed internally. In particular, shell syntax such as:
msg="total: $(cd data; wc -w *.txt) words"
requires the lexical analyzer to recursively invoke the parser in order
to detect problems within the $(...) expression inside the double-quoted
string. In this case, the recursive parse context will detect the broken
&&-chain between the `cd` and `wc` commands, returning the token stream:
cd data ; ?!AMP?! wc -w *.txt
However, the parent parse context will see everything inside the
double-quotes as a single string token:
"total: $(cd data ; ?!AMP?! wc -w *.txt) words"
losing whatever positional information was attached to the ";" token
where the problem was detected.
One way to preserve the positional information of a detected problem in
a recursive parse context within a string would be to attach the
positional information to the annotation textually; for instance:
"total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words"
and then extract the positional information when annotating the original
test definition.
However, a cleaner and much simpler approach is to maintain the list of
detected problems separately rather than embedding the problems as
annotations directly in the parsed token stream. Not only does this
ensure that positional information within recursive parse contexts is
not lost, but it keeps the token stream free from non-token pollution,
which may simplify implementation of validations added in the future
since they won't have to handle non-token "?!FOO!?" items specially.
Finally, the chainlint self-test "expect" files need a few mechanical
adjustments now that the original test definitions are emitted rather
than the parsed token stream. In particular, the following items missing
from the historic parsed-token output are now preserved verbatim:
* indentation (and whitespace, in general)
* comments
* here-doc bodies
* here-doc tag quoting (i.e. "\EOF")
* line-splices (i.e. "\" at the end of a line)
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:30 +01:00
|
|
|
my $problems = $parser->{problems};
|
|
|
|
return unless $emit_all || @$problems;
|
2022-09-13 06:01:47 +02:00
|
|
|
my $c = main::fd_colors(1);
|
chainlint: annotate original test definition rather than token stream
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
However, now that each parsed token carries positional information, the
location of a detected problem can be pinpointed precisely in the
original test definition. Therefore, take advantage of this information
to annotate the test definition itself rather than annotating the parsed
token stream, thus making it easier for a test author to relate a
problem back to the source.
Maintaining the positional meta-information associated with each
detected problem requires a slight change in how the problems are
managed internally. In particular, shell syntax such as:
msg="total: $(cd data; wc -w *.txt) words"
requires the lexical analyzer to recursively invoke the parser in order
to detect problems within the $(...) expression inside the double-quoted
string. In this case, the recursive parse context will detect the broken
&&-chain between the `cd` and `wc` commands, returning the token stream:
cd data ; ?!AMP?! wc -w *.txt
However, the parent parse context will see everything inside the
double-quotes as a single string token:
"total: $(cd data ; ?!AMP?! wc -w *.txt) words"
losing whatever positional information was attached to the ";" token
where the problem was detected.
One way to preserve the positional information of a detected problem in
a recursive parse context within a string would be to attach the
positional information to the annotation textually; for instance:
"total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words"
and then extract the positional information when annotating the original
test definition.
However, a cleaner and much simpler approach is to maintain the list of
detected problems separately rather than embedding the problems as
annotations directly in the parsed token stream. Not only does this
ensure that positional information within recursive parse contexts is
not lost, but it keeps the token stream free from non-token pollution,
which may simplify implementation of validations added in the future
since they won't have to handle non-token "?!FOO!?" items specially.
Finally, the chainlint self-test "expect" files need a few mechanical
adjustments now that the original test definitions are emitted rather
than the parsed token stream. In particular, the following items missing
from the historic parsed-token output are now preserved verbatim:
* indentation (and whitespace, in general)
* comments
* here-doc bodies
* here-doc tag quoting (i.e. "\EOF")
* line-splices (i.e. "\" at the end of a line)
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:30 +01:00
|
|
|
my $start = 0;
|
|
|
|
my $checked = '';
|
|
|
|
for (sort {$a->[1]->[2] <=> $b->[1]->[2]} @$problems) {
|
|
|
|
my ($label, $token) = @$_;
|
|
|
|
my $pos = $token->[2];
|
|
|
|
$checked .= substr($body, $start, $pos - $start) . " ?!$label?! ";
|
|
|
|
$start = $pos;
|
|
|
|
}
|
|
|
|
$checked .= substr($body, $start);
|
2022-09-01 02:29:43 +02:00
|
|
|
$checked =~ s/^\n//;
|
chainlint: annotate original test definition rather than token stream
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
However, now that each parsed token carries positional information, the
location of a detected problem can be pinpointed precisely in the
original test definition. Therefore, take advantage of this information
to annotate the test definition itself rather than annotating the parsed
token stream, thus making it easier for a test author to relate a
problem back to the source.
Maintaining the positional meta-information associated with each
detected problem requires a slight change in how the problems are
managed internally. In particular, shell syntax such as:
msg="total: $(cd data; wc -w *.txt) words"
requires the lexical analyzer to recursively invoke the parser in order
to detect problems within the $(...) expression inside the double-quoted
string. In this case, the recursive parse context will detect the broken
&&-chain between the `cd` and `wc` commands, returning the token stream:
cd data ; ?!AMP?! wc -w *.txt
However, the parent parse context will see everything inside the
double-quotes as a single string token:
"total: $(cd data ; ?!AMP?! wc -w *.txt) words"
losing whatever positional information was attached to the ";" token
where the problem was detected.
One way to preserve the positional information of a detected problem in
a recursive parse context within a string would be to attach the
positional information to the annotation textually; for instance:
"total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words"
and then extract the positional information when annotating the original
test definition.
However, a cleaner and much simpler approach is to maintain the list of
detected problems separately rather than embedding the problems as
annotations directly in the parsed token stream. Not only does this
ensure that positional information within recursive parse contexts is
not lost, but it keeps the token stream free from non-token pollution,
which may simplify implementation of validations added in the future
since they won't have to handle non-token "?!FOO!?" items specially.
Finally, the chainlint self-test "expect" files need a few mechanical
adjustments now that the original test definitions are emitted rather
than the parsed token stream. In particular, the following items missing
from the historic parsed-token output are now preserved verbatim:
* indentation (and whitespace, in general)
* comments
* here-doc bodies
* here-doc tag quoting (i.e. "\EOF")
* line-splices (i.e. "\" at the end of a line)
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:30 +01:00
|
|
|
$checked =~ s/(\s) \?!/$1?!/mg;
|
|
|
|
$checked =~ s/\?! (\s)/?!$1/mg;
|
2022-09-13 06:01:47 +02:00
|
|
|
$checked =~ s/(\?![^?]+\?!)/$c->{rev}$c->{red}$1$c->{reset}/mg;
|
2022-09-01 02:29:43 +02:00
|
|
|
$checked .= "\n" unless $checked =~ /\n$/;
|
2022-09-13 06:01:47 +02:00
|
|
|
push(@{$self->{output}}, "$c->{blue}# chainlint: $title$c->{reset}\n$checked");
|
2022-09-01 02:29:43 +02:00
|
|
|
}
|
|
|
|
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
sub parse_cmd {
|
2022-09-01 02:29:43 +02:00
|
|
|
my $self = shift @_;
|
|
|
|
my @tokens = $self->SUPER::parse_cmd();
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
return @tokens unless @tokens && $tokens[0]->[0] =~ /^test_expect_(?:success|failure)$/;
|
2022-09-01 02:29:43 +02:00
|
|
|
my $n = $#tokens;
|
chainlint: latch start/end position of each token
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.
In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:
package Token;
sub new {
my ($class, $lexeme, $start, $end) = @_;
bless [$lexeme, $start, $end] => $class;
}
sub as_string {
my $self = shift @_;
return $self->[0];
}
sub compare {
my ($x, $y) = @_;
if (UNIVERSAL::isa($y, 'Token')) {
return $x->[0] cmp $y->[0];
}
return $x->[0] cmp $y;
}
use overload (
'""' => 'as_string',
'cmp' => 'compare'
);
The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.
The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-08 20:08:29 +01:00
|
|
|
$n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/;
|
2022-09-01 02:29:43 +02:00
|
|
|
$self->check_test($tokens[1], $tokens[2]) if $n == 2; # title body
|
|
|
|
$self->check_test($tokens[2], $tokens[3]) if $n > 2; # prereq title body
|
|
|
|
return @tokens;
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
# main contains high-level functionality for processing command-line switches,
|
|
|
|
# feeding input test scripts to ScriptParser, and reporting results.
|
|
|
|
package main;
|
|
|
|
|
|
|
|
my $getnow = sub { return time(); };
|
|
|
|
my $interval = sub { return time() - shift; };
|
|
|
|
if (eval {require Time::HiRes; Time::HiRes->import(); 1;}) {
|
|
|
|
$getnow = sub { return [Time::HiRes::gettimeofday()]; };
|
|
|
|
$interval = sub { return Time::HiRes::tv_interval(shift); };
|
|
|
|
}
|
|
|
|
|
2022-09-13 06:01:47 +02:00
|
|
|
# Restore TERM if test framework set it to "dumb" so 'tput' will work; do this
|
|
|
|
# outside of get_colors() since under 'ithreads' all threads use %ENV of main
|
|
|
|
# thread and ignore %ENV changes in subthreads.
|
|
|
|
$ENV{TERM} = $ENV{USER_TERM} if $ENV{USER_TERM};
|
|
|
|
|
|
|
|
my @NOCOLORS = (bold => '', rev => '', reset => '', blue => '', green => '', red => '');
|
|
|
|
my %COLORS = ();
|
|
|
|
sub get_colors {
|
|
|
|
return \%COLORS if %COLORS;
|
|
|
|
if (exists($ENV{NO_COLOR}) ||
|
|
|
|
system("tput sgr0 >/dev/null 2>&1") != 0 ||
|
|
|
|
system("tput bold >/dev/null 2>&1") != 0 ||
|
|
|
|
system("tput rev >/dev/null 2>&1") != 0 ||
|
|
|
|
system("tput setaf 1 >/dev/null 2>&1") != 0) {
|
|
|
|
%COLORS = @NOCOLORS;
|
|
|
|
return \%COLORS;
|
|
|
|
}
|
|
|
|
%COLORS = (bold => `tput bold`,
|
|
|
|
rev => `tput rev`,
|
|
|
|
reset => `tput sgr0`,
|
|
|
|
blue => `tput setaf 4`,
|
|
|
|
green => `tput setaf 2`,
|
|
|
|
red => `tput setaf 1`);
|
|
|
|
chomp(%COLORS);
|
|
|
|
return \%COLORS;
|
|
|
|
}
|
|
|
|
|
|
|
|
my %FD_COLORS = ();
|
|
|
|
sub fd_colors {
|
|
|
|
my $fd = shift;
|
|
|
|
return $FD_COLORS{$fd} if exists($FD_COLORS{$fd});
|
|
|
|
$FD_COLORS{$fd} = -t $fd ? get_colors() : {@NOCOLORS};
|
|
|
|
return $FD_COLORS{$fd};
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:44 +02:00
|
|
|
sub ncores {
|
|
|
|
# Windows
|
|
|
|
return $ENV{NUMBER_OF_PROCESSORS} if exists($ENV{NUMBER_OF_PROCESSORS});
|
|
|
|
# Linux / MSYS2 / Cygwin / WSL
|
|
|
|
do { local @ARGV='/proc/cpuinfo'; return scalar(grep(/^processor\s*:/, <>)); } if -r '/proc/cpuinfo';
|
|
|
|
# macOS & BSD
|
|
|
|
return qx/sysctl -n hw.ncpu/ if $^O =~ /(?:^darwin$|bsd)/;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
sub show_stats {
|
|
|
|
my ($start_time, $stats) = @_;
|
|
|
|
my $walltime = $interval->($start_time);
|
|
|
|
my ($usertime) = times();
|
|
|
|
my ($total_workers, $total_scripts, $total_tests, $total_errs) = (0, 0, 0, 0);
|
2022-09-13 06:01:47 +02:00
|
|
|
my $c = fd_colors(2);
|
|
|
|
print(STDERR $c->{green});
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
for (@$stats) {
|
|
|
|
my ($worker, $nscripts, $ntests, $nerrs) = @$_;
|
|
|
|
print(STDERR "worker $worker: $nscripts scripts, $ntests tests, $nerrs errors\n");
|
|
|
|
$total_workers++;
|
|
|
|
$total_scripts += $nscripts;
|
|
|
|
$total_tests += $ntests;
|
|
|
|
$total_errs += $nerrs;
|
|
|
|
}
|
2022-09-13 06:01:47 +02:00
|
|
|
printf(STDERR "total: %d workers, %d scripts, %d tests, %d errors, %.2fs/%.2fs (wall/user)$c->{reset}\n", $total_workers, $total_scripts, $total_tests, $total_errs, $walltime, $usertime);
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
sub check_script {
|
|
|
|
my ($id, $next_script, $emit) = @_;
|
|
|
|
my ($nscripts, $ntests, $nerrs) = (0, 0, 0);
|
|
|
|
while (my $path = $next_script->()) {
|
|
|
|
$nscripts++;
|
|
|
|
my $fh;
|
|
|
|
unless (open($fh, "<", $path)) {
|
|
|
|
$emit->("?!ERR?! $path: $!\n");
|
|
|
|
next;
|
|
|
|
}
|
|
|
|
my $s = do { local $/; <$fh> };
|
|
|
|
close($fh);
|
|
|
|
my $parser = ScriptParser->new(\$s);
|
|
|
|
1 while $parser->parse_cmd();
|
|
|
|
if (@{$parser->{output}}) {
|
2022-09-13 06:01:47 +02:00
|
|
|
my $c = fd_colors(1);
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
my $s = join('', @{$parser->{output}});
|
2022-09-13 06:01:47 +02:00
|
|
|
$emit->("$c->{bold}$c->{blue}# chainlint: $path$c->{reset}\n" . $s);
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
$nerrs += () = $s =~ /\?![^?]+\?!/g;
|
|
|
|
}
|
|
|
|
$ntests += $parser->{ntests};
|
|
|
|
}
|
|
|
|
return [$id, $nscripts, $ntests, $nerrs];
|
|
|
|
}
|
|
|
|
|
|
|
|
sub exit_code {
|
|
|
|
my $stats = shift @_;
|
|
|
|
for (@$stats) {
|
|
|
|
my ($worker, $nscripts, $ntests, $nerrs) = @$_;
|
|
|
|
return 1 if $nerrs;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
Getopt::Long::Configure(qw{bundling});
|
|
|
|
GetOptions(
|
|
|
|
"emit-all!" => \$emit_all,
|
2022-09-01 02:29:44 +02:00
|
|
|
"jobs|j=i" => \$jobs,
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
"stats|show-stats!" => \$show_stats) or die("option error\n");
|
2022-09-01 02:29:44 +02:00
|
|
|
$jobs = ncores() if $jobs < 1;
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
|
|
|
|
my $start_time = $getnow->();
|
|
|
|
my @stats;
|
|
|
|
|
|
|
|
my @scripts;
|
|
|
|
push(@scripts, File::Glob::bsd_glob($_)) for (@ARGV);
|
|
|
|
unless (@scripts) {
|
|
|
|
show_stats($start_time, \@stats) if $show_stats;
|
|
|
|
exit;
|
|
|
|
}
|
|
|
|
|
2022-09-01 02:29:44 +02:00
|
|
|
unless ($Config{useithreads} && eval {
|
|
|
|
require threads; threads->import();
|
|
|
|
require Thread::Queue; Thread::Queue->import();
|
|
|
|
1;
|
|
|
|
}) {
|
|
|
|
push(@stats, check_script(1, sub { shift(@scripts); }, sub { print(@_); }));
|
|
|
|
show_stats($start_time, \@stats) if $show_stats;
|
|
|
|
exit(exit_code(\@stats));
|
|
|
|
}
|
|
|
|
|
|
|
|
my $script_queue = Thread::Queue->new();
|
|
|
|
my $output_queue = Thread::Queue->new();
|
|
|
|
|
|
|
|
sub next_script { return $script_queue->dequeue(); }
|
|
|
|
sub emit { $output_queue->enqueue(@_); }
|
|
|
|
|
|
|
|
sub monitor {
|
|
|
|
while (my $s = $output_queue->dequeue()) {
|
|
|
|
print($s);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
my $mon = threads->create({'context' => 'void'}, \&monitor);
|
|
|
|
threads->create({'context' => 'list'}, \&check_script, $_, \&next_script, \&emit) for 1..$jobs;
|
|
|
|
|
|
|
|
$script_queue->enqueue(@scripts);
|
|
|
|
$script_queue->end();
|
|
|
|
|
|
|
|
for (threads->list()) {
|
|
|
|
push(@stats, $_->join()) unless $_ == $mon;
|
|
|
|
}
|
|
|
|
|
|
|
|
$output_queue->end();
|
|
|
|
$mon->join();
|
|
|
|
|
t: add skeleton chainlint.pl
Although chainlint.sed usefully identifies broken &&-chains in tests, it
has several shortcomings which include:
* only detects &&-chain breakage in subshells (one-level deep)
* does not check for broken top-level &&-chains; that task is left to
the "magic exit code 117" checker built into test-lib.sh, however,
that detection does not extend to `{...}` blocks, `$(...)`
expressions, or compound statements such as `if...fi`,
`while...done`, `case...esac`
* uses heuristics, which makes it (potentially) fallible and difficult
to tweak to handle additional real-world cases
* written in `sed` and employs advanced `sed` operators which are
probably not well-known to many programmers, thus the pool of people
who can maintain it is likely small
* manually simulates recursion into subshells which makes it much more
difficult to reason about than, say, a traditional top-down parser
* checks each test as the test is run, which can get expensive for
tests which are run repeatedly by functions or loops since their
bodies will be checked over and over (tens or hundreds of times)
unnecessarily
To address these shortcomings, begin implementing a more functional and
precise test linter which understands shell syntax and semantics rather
than employing heuristics, thus is able to recognize structural problems
with tests beyond broken &&-chains.
The new linter is written in Perl, thus should be more accessible to a
wider audience, and is structured as a traditional top-down parser which
makes it much easier to reason about, and allows it to inspect compound
statements within test bodies to any depth.
Furthermore, it can check all test definitions in the entire project in
a single invocation rather than having to be invoked once per test, and
each test definition is checked only once no matter how many times the
test is actually run.
At this stage, the new linter is just a skeleton containing boilerplate
which handles command-line options, collects and reports statistics, and
feeds its arguments -- paths of test scripts -- to a (presently)
do-nothing script parser for validation. Subsequent changes will flesh
out the functionality.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 02:29:39 +02:00
|
|
|
show_stats($start_time, \@stats) if $show_stats;
|
|
|
|
exit(exit_code(\@stats));
|