Tricks to catch mistakes in code and text – from Zabbix
What is common between following open-source projects:
glibc
gcc
llvm-project
mysql-server
postgresql
timescaleDB
libxml2
httpd
openssl
?
- They are very successful and well-known.
- Zabbix (open-source monitoring of IT infrastructure/applications) relies on them
- They all have “the the” typos in their codebase…
gcc
./libcpp/expr.cc
/* Return the string type corresponding to the the input user-defined string
llvm-project
./flang/include/flang/Parser/parse-tree.h
// with order of the the requirement productions in the grammar.
glibc
./resolv/res_send.c
/* If the current buffer is not the the static
mysql-server
./storage/innobase/include/buf0buf.h
/* If the the version is OK, then the space must not be deleted.
postgresql
./src/backend/executor/execExprInterp.c
* JSON_EXISTS_OP to the target type. If the the target type is integer
timescaleDB
./tsl/src/nodes/vector_agg/exec.c
/* Get a reference the the output TupleTableSlot */
libxml2
/tree.c
* if we are the the (at least) 3rd level of
httpd
./docs/manual/mod/mod_log_config.xml
<li> Time taken for the the useragent to read and process the
openssl
./test/bio_base64_test.c
* Only the the first four variants make sense with padding or truncated
One of the main advantages of open-source model is that people outside your project can comment on improvements. You want these people to comment on functional and architectural errors. Not typos and silly mistakes.
Luckily, you don’t need a complicated/expensive software to fix most of them in your project.
This particular issue can be found by using grep tool. It should be available on every Linux/Unix like platform out of the box: grep -rI ” the the ” .
Ok, let’s say we fixed it, but what now ?
What about ” a a ” ?
Most of the projects I mentioned contain this typo as well. For example:
openssl:
./include/openssl/ssl.h.in
/* Maximum length of the application-controlled segment of a a TLSv1.3 cookie */
Ok, what do we do then ? grep for all words one-by-one?
Of course not, we can just use egrep (variant of grep tool):
find . -type f -name ‘*.*’ -exec egrep –with-filename “(\b[a-zA-Z]+)\s+\1\b” {} \;
And this will show us all repeated words in the directory.
Many of them will be false-positives, like “long long” variable type declaration:
openssl
./crypto/ec/ecx_meth.c:
unsigned long long buff[512];
(which we can filter out by attaching “| awk ‘!/long long/'”)
But, it will also show plenty of legitimate repetitions like:
“is is”
openssl
./crypto/perlasm/x86_64-xlate.pl:
# ; this is is the text section/segment
“of of”
gcc
./libgo/go/cmd/go/internal/modload/buildlist.go:
// keep is a set of of modules that provide packages or are needed to
“that that”
https
./modules/md/md_reg.h:
* indicates that that renewal is not configured (see renew_mode).
and many others..
I believe the coding is an art. Can you imagine the greatest pieces of literature in the history contain “and and”, “to to ” and “in in” in them?
Repetitive words make code harder to read. They also leave an impression that the code was not properly reviewed. But there are plenty of other error types that we can also detect.
In my talk I would like to go through the list of 1) simple, 2) quick and 3) free-of-charge techniques/tools to catch errors in your open-source project and make it better.
The Featured Blog Posts series highlights posts from partners and members of the All Things Open community leading up to ATO 2024.