Index Interval Format Issue

lua-users home
wiki

Here is discussed the combined issue of 1-based indexing and closed interval slicing versus 0-based indexing and right-open interval slicing. Other solutions are, in my opinion, not consistent.

[Note from the initial author: Though this text first expresses my point of view on this topic, as a wiki page it is intended to be further elaborated (also polished, as I'm not a native English speaker!). Yet, I have thought and talked about this issue for a while, especially because I've been involved in training; so you may consider thinking, too, before firing. Remove this note after editing. --DeniSpir (denis dot spir at free dot fr), 8 Oct 2008.]

Origin Of The War?

The source of the whole point lies in the history of programming. An implementation detail of some languages such as C is that array items are referenced through a pointer + offset address. Thus, the offset exists at the implementation level, which leads to a choice for the language designer to either transfer that scheme into the language itself, so that the programmer has to adapt to it, or let the compiler or interpreter subtract one from all indexes so that the (often human) programmer can use ordinals instead. This choice was not the one for C, and as a consequence for the whole C/C++/Java/Python/Ruby family. As C has been highly successful, language designers who want their 'baby' be widely used have to take this de facto standard into account -- in addition to the fact that they also are C programmers. Now, stating that a feature is "natural" because it is a standard simply is a logical error.

The use of zero-indexing in C is a choice that many (probably most) programmers agree with to this day, and you cannot just assume that it is only because of familiarity or convention that they continue to prefer it (unless of course you take your conclusion as a premise). Your last sentence, and terms like "baby", unfairly tar all opponents of your viewpoint with the same brush. —MarceloCantos?

Campaigns

Humans Are Humans

Arrays, lists, sequences, and tables are ordered sets. Humans use ordinal numbers--first, second, third (or #1, #2, #3)--for pointing an element in a set (which one?). In contrast, they use cardinal numbers--1, 2, 3--to count the number of elements (how many?). Unfortunately enough, programming is massively English speaking, and this language calls both number types "numbers" (compare with Nummer/Zahl in German or numéro/nombre in French). Still all languages make a big difference between the two. In all fields of human life, except for a sub-field of programming, ordinals are used for pointing things. Look for year 0, the 0th book on a shelve, or Baker Street, #0 ;-). Moreover, ordinals seem to have appeared first, which is the reason why "zero", a cardinal, is so young in human history. Languages witness this fact: people say "nobody", "never", and "nothing" instead of using the word "zero". All of this is to say that at the human level, the programmer-friendly choice is to use ordinals, not offsets, to reference items in an ordered set.

Humans also use 0 for the first minute of the hour. Also consider the stupidities foisted on us by one-based counting. The new millenium started on 1/1/2001. a[(i - 1)%n + 1] . The first hour of the day is 12 and the last hour is 11 (it seems they had the presence of mind to make 1 o'clock start one hour after midnight, but then got all tied up in knots when it came to midnight itself). Lists of ten always have one two-digit number; this thoroughly confuses children learning to count and read to 10 and confuses them even more when they have to learn about zero. The first kilometer of a trip starts at zero kilometers. —MarceloCantos?

Your argument regarding the terms "nobody", "never" and "nothing" is a non sequiter. These are all cardinal terms, not ordinal. It is perhaps odd that such terms are used to describe sets of cardinality zero, but this has little to do with the question of whether ordinals should start at zero. —MarceloCantos?

Similarly, closed intervals happen to be more natural or intuitive. In everyday life, "from the first to the third" has an inclusive meaning in any case. Using half-open intervals thus first demands a mental work which eventually becomes automatic. This neither means that such a syntax is less efficient, nor less rational, but simply less human-friendly for non-initiates. Some programmers seem to be proud of such esoteric features that lead newbies to errors ;-).

Again, the use of emotive terms such as "proud" and "esoteric" is an ad hominem rhetorical device that tries to sway the reader by means other than rational argumentation. —MarceloCantos?

Some programmers find half-open intervals to be a much saner system that produces more reliable code than closed intervals, and that reliability trumps conformity to a millenia-old mistake for the sake of what non-initiates find "intuitive". —MarceloCantos?

Here it is necessary to distinguish process and human semantic levels. On the process level, both schemes are equally functional, and both semantics are consistent. On the human level, the 0-base/half-open scheme needs a kind of 'transcoding' from the process layer. As a consequence, if one agrees that source code (as well as mathematical expressions for instance) is to be read by humans first, rather than by machines, then this scheme is a lesser option -- one should choose it only if the alternative proved to be impossible for any reason. Note that Pascal, while as old as C, has taken the opposite way -- not a coincidence, as it has been designed to be suitable for education.

Programmers, Too

In everyday work, once one is used to one or the other convention, both schemes prove to be usable. When programmers discuss about that issue, they tend to bitterly hold to their personal comfort, that is their mental habits built through everyday use. As mostly used languages follow the C convention, there are many arguments supporting it. Now, when carefully watched, these words happen to be wrong as logical reasons: behind rational expression, they are opinions.

Both schemes may be usable, but half-open ranges are substantively simpler, more rational and less prone to off-by-one mistakes. —MarceloCantos?

For instance, the argument that a 0-length range is best expressed with [n,n) is more or less meaningless; the only sensible way to express an empty sequence is []. Both [5,5) and [5,4] simply are semantically absurd. Now, it may happen at run time that a sequence slice is empty -- this is another level: the programmer does not cope with it at design time.

First off, a minor quibble: the argument may have more or less merit, but it makes no sense to say that it is meaningless. —MarceloCantos?

Using [] to express the empty range is awkward. A computer program will typically use a data structure with two end-points: struct range { int start, end; };. The code to intersect two such ranges is very simple: range intersect(range a, range b) { range r = { max(a.start, b.start); min(a.end, b.end); if (r.end < r.start) r.end = r.start; return r; }, as is the cardinality: int count(range r) { return r.end - r.start; }, and the empty function: bool empty(range r) { return r.start == r.end; }. As soon as you require a different representation for empty ranges, everything suddenly gets more complicated and consumes more RAM and CPU. You'll also find that you have to liberally sprinkle your code with + 1 and - 1 to correct for various off-by-one issues (e.g., the count now requires r.end - r.start + 1 as well as explicitly testing for the empty state). —MarceloCantos?

Closed ranges are extremely awkward for continuous quantities. How do you express the range of timestamps corresponding to a single day? Half-open ranges do this easily: [2010-01-01, 2010-01-02). Closed ranges would need to do this: [2010-01-01, 2010-01-01 23:59:59], which has at least two problems: 1) it necessarily assumes a quantum; 2) it is an unnatural and complex way to express the simple concept of a day. Half-open ranges are also trivial to splice into contiguous non-intersecting sub-ranges: [e, pi) can be spliced into [e, 3) and [3, pi). In the general case, this is impossible with closed ranges, and even with a quantum, it is not at all obvious that the pair [2010-01-01, 2010-01-01 11:59:59] and [2010-01-01 12:00:00, 2010-01-01 23:59:59] is contiguous (is the quantum one second?). While many of these issues are more tractable for discrete applications such as array indexing, it remains the case that half-open ranges are easier to work with. As a simple example, it is blatantly obvious that [a, b) and [b, c) are contiguous, whereas [a, b] and [c, d] require further inspection of the code. Also, it is easier to represent an n-way splice of a half-open range, [a, b), [b, c), ..., [y, z) as an array, [a, b, c, ..., z], whereas with an n-way splice of a closed range, you have be careful to remember whether you are storing the start of each splice or the end. —MarceloCantos?

Consider also the fact that the "length" of a half-open range is end - start, regardless of whether the range is continuous or discrete, whereas the formula for the "length" of a closed range is different for continuous and discrete ranges. As a practical and quite common example, the number of days in a half-open date range is computed the same way regardless of whether the end-points are dates or timestamps. This is why I tend to prefer date1 <= d AND d < date2 over SQL's BETWEEN operator; I never have to think about whether the variables in question are dates or timestamps, or worry about whether a maintainer will ever switch the types from one to the other. —MarceloCantos?

[5,5) is a degenerate case, and such cases are often very useful. What, exactly, makes it absurd? —MarceloCantos?

Light(s) For Peace

In the 21st century, should such losses of time and energy still happen? There are at least two paths to get out of this problem.

Explicit Syntax

As some note, BASIC allows explicit expression of the array index scheme. For instance, array(0,10) would define a zero-based range of indexes. In a similar point of view, one could explicitly use half-open intervals using either of the common mathematical syntaxes for that: [a,b) or [a,b[ . This option has the additional advantage of getting rid of error-prone format, as the scheme used is explicitly written. All of this is good for the human reader, yet the problem remains to properly interpret expressions written in a convention one is not used to. The following proposal addresses this issue.

Editor Customization Layer

As programmers all of us are familiar with nice editor customization features, like indent options that allow us to say if we prefer tabs or spaces or how wide an indent should be. Note that this works both for loading/reading and editing/saving source code files. Whatever the author's options, we are able to read and edit the code with our own preferred convention. Whatever our options, another developer will be able to read and edit the same code using his own choices. Now, the file may be saved according to any standard norm, no matter. This is a kind of foreground/background distinction performed through an editor customization layer. Great!

Now, why not extend this principle to *any* language feature? For instance, one may prefer to get rid of the '=' for assignments (which is semantically wrong), replace it with e.g. ':', and use '=' only for logical equality instead. When loading a file, the editor should display the code using these preferences, whatever the standard used for saving.

The same is worth for indexes and slices: a C programmer would set 0-base/half-open slices and get the code displayed that way, whatever the background saving standard may be. A programmer used to Delphi would choose the opposite scheme, no matter.

[What do you think of that? I mainly program in Python these days, and most of the IDEs for this language are written in Python/wxPython, too. So that when I have time for it, I aim to implement this feature into an editor...]

"There be dragons." How would an editor convert a[len - 1 - i] to 1-based indexing? Would it simply wrap the index thus: a[(len - 1 - i) + 1], or would it try to parse the content, looking for a "- 1" to remove or a "+/- <constant>" to alter, thus: a[len - i]? What if the code were unfinished: a[len - ] or simply perverse: a[~i & 0xff] /* len == 256 */? And how would you deal with C++, which lets you overload operator []? Also consider that programmers frequently share a screen while discussing code, which would lead to horrendous confusion. —MarceloCantos?

See Also


RecentChanges · preferences
edit · history
Last edited January 16, 2010 10:47 pm GMT (diff)