Saturday, December 05, 2009
Troubles Outside of ASCII
Unicode presents the language designer with thousands of temptations. But forcing programmers to use characters outside of the ASCII range causes them significant grief. Here are some reasons why:
Rosebud does not stop you from using Unicode outside of the ASCII range. In fact there is no explicit limitation to ASCII in any part of Rosebud -- it is all UTF-8. However, Rosebud avoids including characters outside of the ASCII Range in its normal syntax. The exceptions (so far) are INFINITY (U+221E) and WARNING SIGN (U+26A0), which are used to represent the (somewhat esoteric) floating point literals for infinity and not-a-number, and REPLACEMENT CHARACTER (U+FFFD) which is generated in the case of UTF-8 encoding errors. Programmers may give these literal values other names without penalty, so if one must enter the floating point value negative infinity as a literal in their program with great frequency, no non-ASCII input method for the U+221E code need be obtained -- one just enters it in their program once and names it with an ASCII name. Using the non-ASCII characters in the syntax as opposed to a 'keyword', such as "infinity" or "NaN", makes the parsing rules far simpler.
In fact, Rosebud has no keywords at all, which means the temptation is only to use Unicode outside of the ASCII range for 'punctuation' used to express program structure. There are places in the design where new kinds of left and right matching punctuation characters would be very useful to indicate a new kind of structure. But, O, that way madness lies; let me shun that; no more of that.
- Fonts used by programmers are generally excellent for reading ASCII, but once the number of glyphs grows large, disambiguating between characters becomes troublesome. For example, the visual differences between 1 and l, and O and 0 have been well worked out. The difference between ‹ and < has not, in many cases. Even in fixed-width fonts with excellent Unicode support, the difference between some glyphs is so slight (and justifiably so!) that the use of such glyphs is a font-dependent decision. Rosebud does not specify any particular font.
- Keyboards and input programs used by programmers are generally excellent for entering ASCII, but past ASCII are often optimized for a particular language or locale. Rosebud should be very useful to a programmer who expresses themselves naturally outside of ASCII, but ASCII remains the closest thing to a 'lowest common denominator' we can design for.
- "Avoid non-ASCII" forms a clear, measured constraint which prohibits the overall language design from too much complexity.
Rosebud does not stop you from using Unicode outside of the ASCII range. In fact there is no explicit limitation to ASCII in any part of Rosebud -- it is all UTF-8. However, Rosebud avoids including characters outside of the ASCII Range in its normal syntax. The exceptions (so far) are INFINITY (U+221E) and WARNING SIGN (U+26A0), which are used to represent the (somewhat esoteric) floating point literals for infinity and not-a-number, and REPLACEMENT CHARACTER (U+FFFD) which is generated in the case of UTF-8 encoding errors. Programmers may give these literal values other names without penalty, so if one must enter the floating point value negative infinity as a literal in their program with great frequency, no non-ASCII input method for the U+221E code need be obtained -- one just enters it in their program once and names it with an ASCII name. Using the non-ASCII characters in the syntax as opposed to a 'keyword', such as "infinity" or "NaN", makes the parsing rules far simpler.
In fact, Rosebud has no keywords at all, which means the temptation is only to use Unicode outside of the ASCII range for 'punctuation' used to express program structure. There are places in the design where new kinds of left and right matching punctuation characters would be very useful to indicate a new kind of structure. But, O, that way madness lies; let me shun that; no more of that.
Saturday, November 28, 2009
More Encoding and Lexical Analysis of Rosebud
This blog post extends my earlier remarks on the encoding/lexical design of Rosebud.
As a language grows, the deeper levels tend to stay the same. For example, there will be syntactic changes in the recent language growth to Java 7. But only one of these syntactic changes reaches down to the level of lexical analysis (large number literals with underscores as in 1_000_000) . The risk of ruining too much old code is too great to attempt too many such changes. The design of the lexical system of a programming language must be careful and general. If it limits growth, you can't easily fix it later.
Rosebud programs are expressed as sequences of bytes that are normally considered as UTF-8 encodings of Unicode code points. Other encodings were considered, but there was no clear reason to use anything other than UTF-8. Without lots of real data it is impossible to say that another encoding is more compact. Only UTF-8 sidesteps byte ordering issues. It is possible in Rosebud to express literal strings of arbitrary bytes, and also to escape the normal parser and lexer entirely. Therefore other encodings can be present within Rosebud source files. Normally, however, it's UTF-8 all the way down.
This blog post gives a representative reaction of the issues associated with ignoring root language issues.
Rosebud has several base syntaxes and can be extended to have arbitrarily more without eviscerating the compiler. It is also possible to use a different lexer. However, extending the lexer should generally be considered an extreme and undesirable tactic, as it nearly always causes too much confusion for human readers. It is therefore important that the design of the lexer be strong enough to serve as a good foundation for as many programs as possible.
There are several things about natural language that influence the design.
(Here must come my usual disclaimer regarding the "appeal to usual orthographies" arguments I make. Contemporary Linguists don't much study orthography. This is because nearly all humans have a "spoken" or "expressed" language, but only a percentage are literate. Reading and writing are not part of our innate language abilities. Computer programs, on the other hand, must exist in some concrete and serial form in order to become input for the computer. The closest thing we have to that on the human side are writing systems. Lacking a research budget, I pose guesses about things that are true of "most" or "usual" human writing systems without a lot of rigor. If you have data to share, please do. But, if you instead wish to attack my arguments by pointing at the relative weakness of my premises, please don't. I'm doing the best I can in my circumstances and I'm already aware of the difficulties.)
Punctuation exists to provide hints to the reader that can only otherwise be conveyed through sonic means. Consider a sign in a barber's window:
Only with punctuation (and perhaps also typography) can we begin to disambiguate this sentence as written.
Presumably the meaning of these words when spoken would be determined through context and/or inflection, but when written we must use context and/or punctuation.
Similarly, in programming languages, punctuation exists to provide structural disambiguation. It is unreasonable to expect the computer to know what you mean in ambiguous cases as you might expect of humans. You can create languages (for example Polish Notation arithmetic) in which no ambiguity exists, at least no ambiguity that needs to be resolved through special marks, but this tends to be more difficult for humans to use. (Write something non-trivial in PostScript if you require further justification for that statement.)
In contemporary programming languages we are always using both words, punctuation, and a third class of things. This third class of things amounts to making ASCII pictograms of the various glyphs of the representational system in common use by contemporary mathematicians (which incidentally is itself rife with ambiguities). The crux of the design is this: You can still use Unicode pictograms in Rosebud, but not at the lexical level. Lexical chunks of the program exist either as words or indications of structure, but there is no third class that behaves a little like both.
For example, a program fragment such as
program fragment
Because + behaves as a word, one can not write a+2 and expect the lexer to understand + lexically as punctuation. This is both good and bad. Programmers might get frustrated by errors caused by the relatively small, visually unclear omissions of white space. On the other hand, a robust compiler can provide excellent diagnoses of such problems. As well, what possible meaning could a name like a+2 have? Names like + with both the quality of being a name and the quality of breaking up other names as if it were punctuation are often called operators in other languages. Operator lexing requires that we either have a very dynamic lexer, or that we identify all operators up front. We above identified mid-parse lexical rules changes as something to be avoided. The number of potential operators is large, and has different desirable traits in different programs. In the final analysis, treating all name lexemes uniformly makes lexical rules simple, uniform, easy to learn, and yields lexemes that will work naturally with a number of parsers.
Rosebud takes the traditional approach of program analysis in that it strongly separates lexical and syntactical analysis. It need not be this way. For example, if we did not want : to act as punctuation, but we did want to use it to indicate names in particular namespaces (for example namespace:namespace:name), we could leave it as a non-punctuation character, but provide a special secondary lexical analysis step in the case of words generated that contained a colon.
As of this blog post, Rosebud is still being designed. The following codepoints are definitely considered space or punctuation by Rosebud, but there may be more:
Edit: U+003C LESS-THAN SIGN and U+003B GREATER-THAN SIGN were included as punctuation characters erroneously. Also, blogger.com's html digester is not a happy thing.
[ EDIT: eliminated misleading statement of opinion. ]
As a language grows, the deeper levels tend to stay the same. For example, there will be syntactic changes in the recent language growth to Java 7. But only one of these syntactic changes reaches down to the level of lexical analysis (large number literals with underscores as in 1_000_000) . The risk of ruining too much old code is too great to attempt too many such changes. The design of the lexical system of a programming language must be careful and general. If it limits growth, you can't easily fix it later.
Rosebud programs are expressed as sequences of bytes that are normally considered as UTF-8 encodings of Unicode code points. Other encodings were considered, but there was no clear reason to use anything other than UTF-8. Without lots of real data it is impossible to say that another encoding is more compact. Only UTF-8 sidesteps byte ordering issues. It is possible in Rosebud to express literal strings of arbitrary bytes, and also to escape the normal parser and lexer entirely. Therefore other encodings can be present within Rosebud source files. Normally, however, it's UTF-8 all the way down.
This blog post gives a representative reaction of the issues associated with ignoring root language issues.
Rosebud has several base syntaxes and can be extended to have arbitrarily more without eviscerating the compiler. It is also possible to use a different lexer. However, extending the lexer should generally be considered an extreme and undesirable tactic, as it nearly always causes too much confusion for human readers. It is therefore important that the design of the lexer be strong enough to serve as a good foundation for as many programs as possible.
There are several things about natural language that influence the design.
(Here must come my usual disclaimer regarding the "appeal to usual orthographies" arguments I make. Contemporary Linguists don't much study orthography. This is because nearly all humans have a "spoken" or "expressed" language, but only a percentage are literate. Reading and writing are not part of our innate language abilities. Computer programs, on the other hand, must exist in some concrete and serial form in order to become input for the computer. The closest thing we have to that on the human side are writing systems. Lacking a research budget, I pose guesses about things that are true of "most" or "usual" human writing systems without a lot of rigor. If you have data to share, please do. But, if you instead wish to attack my arguments by pointing at the relative weakness of my premises, please don't. I'm doing the best I can in my circumstances and I'm already aware of the difficulties.)
Punctuation exists to provide hints to the reader that can only otherwise be conveyed through sonic means. Consider a sign in a barber's window:
what do you think I will shave you for nothing
Only with punctuation (and perhaps also typography) can we begin to disambiguate this sentence as written.
"What, do you think I will shave you for nothing?"
"What do you think? I will shave you for nothing!"
Presumably the meaning of these words when spoken would be determined through context and/or inflection, but when written we must use context and/or punctuation.
Similarly, in programming languages, punctuation exists to provide structural disambiguation. It is unreasonable to expect the computer to know what you mean in ambiguous cases as you might expect of humans. You can create languages (for example Polish Notation arithmetic) in which no ambiguity exists, at least no ambiguity that needs to be resolved through special marks, but this tends to be more difficult for humans to use. (Write something non-trivial in PostScript if you require further justification for that statement.)
In contemporary programming languages we are always using both words, punctuation, and a third class of things. This third class of things amounts to making ASCII pictograms of the various glyphs of the representational system in common use by contemporary mathematicians (which incidentally is itself rife with ambiguities). The crux of the design is this: You can still use Unicode pictograms in Rosebud, but not at the lexical level. Lexical chunks of the program exist either as words or indications of structure, but there is no third class that behaves a little like both.
For example, a program fragment such as
2 + 2when read aloud, sounds like "two plus two". On the other hand, the
program fragment
while [not close-enough n] {
set n [tweak n];
}reads aloud as "While not close-enough en, set en (to) tweak en." The brackets, hyphen and semicolon are structural, and not read aloud. A programmer speaking the program aloud might use inflection to indicate "close-enough" was one word, and the hyphen serves the same job when combining words in English and most of its orthographies.Because + behaves as a word, one can not write a+2 and expect the lexer to understand + lexically as punctuation. This is both good and bad. Programmers might get frustrated by errors caused by the relatively small, visually unclear omissions of white space. On the other hand, a robust compiler can provide excellent diagnoses of such problems. As well, what possible meaning could a name like a+2 have? Names like + with both the quality of being a name and the quality of breaking up other names as if it were punctuation are often called operators in other languages. Operator lexing requires that we either have a very dynamic lexer, or that we identify all operators up front. We above identified mid-parse lexical rules changes as something to be avoided. The number of potential operators is large, and has different desirable traits in different programs. In the final analysis, treating all name lexemes uniformly makes lexical rules simple, uniform, easy to learn, and yields lexemes that will work naturally with a number of parsers.
Rosebud takes the traditional approach of program analysis in that it strongly separates lexical and syntactical analysis. It need not be this way. For example, if we did not want : to act as punctuation, but we did want to use it to indicate names in particular namespaces (for example namespace:namespace:name), we could leave it as a non-punctuation character, but provide a special secondary lexical analysis step in the case of words generated that contained a colon.
As of this blog post, Rosebud is still being designed. The following codepoints are definitely considered space or punctuation by Rosebud, but there may be more:
| 0009 | CHARACTER TABULATION | |
| 000A | LINE FEED | |
| 0020 | SPACE | |
| FFFD | REPLACEMENT CHARACTER |
| 0022 | " | QUOTATION MARK |
| 0023 | # | NUMBER SIGN |
| 0027 | ' | APOSTROPHE |
| 0028 | ( | LEFT PARENTHESIS |
| 0028 | ) | RIGHT PARENTHESIS |
| 002C | , | COMMA |
| 003A | : | COLON |
| 003B | ; | SEMICOLON |
| 0040 | @ | COMMERCIAL AT |
| 005B | [ | LEFT SQUARE BRACKET |
| 005D | ] | RIGHT SQUARE BRACKET |
| 0060 | ` | GRAVE ACCENT |
| 007B | { | LEFT CURLY BRACKET |
| 007D | } | RIGHT CURLY BRACKET |
Edit: U+003C LESS-THAN SIGN and U+003B GREATER-THAN SIGN were included as punctuation characters erroneously. Also, blogger.com's html digester is not a happy thing.
[ EDIT: eliminated misleading statement of opinion. ]
Labels: rosebud-language
Sunday, November 22, 2009
Literate Programming
Today I finished reading most of Literate Programming [Knuth, 84]. It's hard to say anything bad about Knuth, but I'm not okay with this book. Maybe it's just that so much has changed since 1984.
There's a weird "Gosh, I kinda like go to" essay in here. You can kind of see why, and incidentally it paints Knuth as a system programmer at heart, but for the purposes of writing a book on literate programming, it's entirely out of place.
It seems to me that the primary advantages of Knuth's literate programming are:
1. The ability to produce relatively well typeset program listings.
2. The ability to write programs out of the order they need to be in for the compiler to make sense of them, but in a more appropriate order for human readers.
3. Extension of the programming language (Pascal or C) with WEB macros.
4. Indexing.
We have 1. by other means now, although all web browser output is hideous in comparison to that of any reasonable typesetting software in the hands of someone who knows what they're doing. 2. and 3. are the results of having a good macro system. Common Lisp and Scheme have far stronger systems than WEB. 4. seems like some kind of weird OCD symptom.
The rest of the advantages of literate programming as presented in the book are otherwise just good advice for how to document and write comments in your code. We need more of that. But this ain't it.
There's a weird "Gosh, I kinda like go to" essay in here. You can kind of see why, and incidentally it paints Knuth as a system programmer at heart, but for the purposes of writing a book on literate programming, it's entirely out of place.
It seems to me that the primary advantages of Knuth's literate programming are:
1. The ability to produce relatively well typeset program listings.
2. The ability to write programs out of the order they need to be in for the compiler to make sense of them, but in a more appropriate order for human readers.
3. Extension of the programming language (Pascal or C) with WEB macros.
4. Indexing.
We have 1. by other means now, although all web browser output is hideous in comparison to that of any reasonable typesetting software in the hands of someone who knows what they're doing. 2. and 3. are the results of having a good macro system. Common Lisp and Scheme have far stronger systems than WEB. 4. seems like some kind of weird OCD symptom.
The rest of the advantages of literate programming as presented in the book are otherwise just good advice for how to document and write comments in your code. We need more of that. But this ain't it.
Labels: lisp, literate programming, rosebud-language
Sunday, November 15, 2009
Lexical Analysis
Lexical analysis turns a stream of input characters into identifiable chunks; "lexemes". For example the stream of input "3" should turn into the lexeme for the number three. Decisions about the syntax of a language must be made at the lexical level to disambiguate stream of input such as 10E-6. One language may intend for such a stream to mean a single number, namely ten times 10 to the power of negative six. Another may instead intend two numbers and a symbol; ten, E and negative six.
The main decision in designing a lexical system is usually about how to divide lexemes into names, numeric literals, special punctuation, and comments. If the characters that can go in each of these categories do not overlap, separation between lexical elements can be more free in programs. If there are few restrictions on what goes in a name or numeric literal, then disambiguating them from their surroundings requires more care.
For example, imagine language A has the lexical rules; "names may consist only of letters" and "numbers may contain only decimal digits". More complicated numeric literals presumably are constructed using other language features. With these two lexical rules, we can give special lexical significance to single characters that can represent common arithmetic operations. a3 generates two lexemes, a and 3. The program fragment a+3 is read as three lexemes. Programmers are free to use white space as they choose. Transitions between lexemes are accounted for entirely by the contents of the lexemes themselves, unless two numbers or two names occur in sequence.
Now imagine a different language B, which has the lexical rules; "names may consist of any characters other than white space" and "numbers are names which contain only decimal digits". Again, this glosses over more complex numeric literals. Under language B's lexical rules, the program fragment a+3 is a single lexeme (as is a3). To generate three lexemes, the programmer must explicitly separate the lexical elements using white space; a + 3. However, in this case + need have no special significance. It can be a name like any other.
A third general approach to lexical analysis could be characterized by a language C, which has more complicated lexical rules that describe when lexemes begin and when they end. Language C's rules are: a name begins with something that is not a digit, and ends when a space or special character is encountered. A number begins with something that is not a letter, and ends when a space or a special character is encountered. Such a lexical system interprets a+3 as three lexemes, but a3 as one.
Languages A, B, and C as presented above are greatly simplified and gloss over a large number of important issues. But judging these rudimentary forms, it would seem that languages A and C have the preferable quality of doing the right thing if a programmer omits spaces, language C more so than language A. However, the more important question is, which of language A or B conforms more to how the human mind processes language? In my opinion, language B comes closer.
In terms of orthography, we know that words are things human readers take in as units separated by spaces. The well known trick of changing the order of letters in words save for the first and last letter, and observing that such words are still somehow comprehensible, demonstrates this;
"Tihs is an eaxpmle of scuh a senetcne."
Compare your ability to comprehend the above sentence to the one in which letters are all in the correct order, but spaces are omitted;
"Thisisanexampleofsuchasentence."
The eye of a reader can go faster (and correct more errors) if lexemes that produce a sound when read aloud are separated by space. Demanding such separation in a programming language therefore makes it conform to the human eye. Most programming languages which allow a great deal of freedom in the omission of spaces between lexemes also have earnest style guides which encourage programmers to use spaces to separate the lexemes anyway.
In terms of natural language, we chunk lexical elements from either language A or B into several utterable chunks, as in "ey plus three", or "ten times ten to the power of negative six". The amount of compression into the shorthand 10E6 is similar in A, B, and C, whether we demand that it be written 10 E 6 or not. Lexical issues in programming languages must focus more on making an unambiguous orthography than on, say, morphological rules in natural languages.
The exception to the natural use of spaces to separate lexemes comes when we want to form compound names. Many natural languages allow putting multiple nouns together into a single word both in spoken and written form. Sometimes the programmer wants to name something with a single programming language lexeme that contains multiple names, for example, the fifth element of an array named R might be R[5]. It seems unnatural to write R [5], since we are naming one thing, just with compound terms.
Rosebud defines several characters which can not go in names, including white space characters. Lexemes are separated by these characters. These include characters that are used to form compound names, and characters that indicate program structure. Except for white space, these single characters are all themselves lexemes. By convention, as in written English, a hyphen is used to construct names that are compound in meaning to the programmer, but singular in meaning to the author.
Whether UTF-8 or UTF-16 would be a more compact encoding for Rosebud programs is impossible to say. UTF-8 eliminates byte ordering issues. So, Rosebud interprets its input as a stream of bytes in UTF-8. Rosebud has enough syntactic extension facilities so that if large amounts of data that would require three bytes in UTF-8 but only two in UTF-16 needed to live in program files, a UTF-16 encoding could be used for it.
The main decision in designing a lexical system is usually about how to divide lexemes into names, numeric literals, special punctuation, and comments. If the characters that can go in each of these categories do not overlap, separation between lexical elements can be more free in programs. If there are few restrictions on what goes in a name or numeric literal, then disambiguating them from their surroundings requires more care.
For example, imagine language A has the lexical rules; "names may consist only of letters" and "numbers may contain only decimal digits". More complicated numeric literals presumably are constructed using other language features. With these two lexical rules, we can give special lexical significance to single characters that can represent common arithmetic operations. a3 generates two lexemes, a and 3. The program fragment a+3 is read as three lexemes. Programmers are free to use white space as they choose. Transitions between lexemes are accounted for entirely by the contents of the lexemes themselves, unless two numbers or two names occur in sequence.
Now imagine a different language B, which has the lexical rules; "names may consist of any characters other than white space" and "numbers are names which contain only decimal digits". Again, this glosses over more complex numeric literals. Under language B's lexical rules, the program fragment a+3 is a single lexeme (as is a3). To generate three lexemes, the programmer must explicitly separate the lexical elements using white space; a + 3. However, in this case + need have no special significance. It can be a name like any other.
A third general approach to lexical analysis could be characterized by a language C, which has more complicated lexical rules that describe when lexemes begin and when they end. Language C's rules are: a name begins with something that is not a digit, and ends when a space or special character is encountered. A number begins with something that is not a letter, and ends when a space or a special character is encountered. Such a lexical system interprets a+3 as three lexemes, but a3 as one.
Languages A, B, and C as presented above are greatly simplified and gloss over a large number of important issues. But judging these rudimentary forms, it would seem that languages A and C have the preferable quality of doing the right thing if a programmer omits spaces, language C more so than language A. However, the more important question is, which of language A or B conforms more to how the human mind processes language? In my opinion, language B comes closer.
In terms of orthography, we know that words are things human readers take in as units separated by spaces. The well known trick of changing the order of letters in words save for the first and last letter, and observing that such words are still somehow comprehensible, demonstrates this;
"Tihs is an eaxpmle of scuh a senetcne."
Compare your ability to comprehend the above sentence to the one in which letters are all in the correct order, but spaces are omitted;
"Thisisanexampleofsuchasentence."
The eye of a reader can go faster (and correct more errors) if lexemes that produce a sound when read aloud are separated by space. Demanding such separation in a programming language therefore makes it conform to the human eye. Most programming languages which allow a great deal of freedom in the omission of spaces between lexemes also have earnest style guides which encourage programmers to use spaces to separate the lexemes anyway.
In terms of natural language, we chunk lexical elements from either language A or B into several utterable chunks, as in "ey plus three", or "ten times ten to the power of negative six". The amount of compression into the shorthand 10E6 is similar in A, B, and C, whether we demand that it be written 10 E 6 or not. Lexical issues in programming languages must focus more on making an unambiguous orthography than on, say, morphological rules in natural languages.
The exception to the natural use of spaces to separate lexemes comes when we want to form compound names. Many natural languages allow putting multiple nouns together into a single word both in spoken and written form. Sometimes the programmer wants to name something with a single programming language lexeme that contains multiple names, for example, the fifth element of an array named R might be R[5]. It seems unnatural to write R [5], since we are naming one thing, just with compound terms.
Rosebud defines several characters which can not go in names, including white space characters. Lexemes are separated by these characters. These include characters that are used to form compound names, and characters that indicate program structure. Except for white space, these single characters are all themselves lexemes. By convention, as in written English, a hyphen is used to construct names that are compound in meaning to the programmer, but singular in meaning to the author.
Whether UTF-8 or UTF-16 would be a more compact encoding for Rosebud programs is impossible to say. UTF-8 eliminates byte ordering issues. So, Rosebud interprets its input as a stream of bytes in UTF-8. Rosebud has enough syntactic extension facilities so that if large amounts of data that would require three bytes in UTF-8 but only two in UTF-16 needed to live in program files, a UTF-16 encoding could be used for it.
Labels: rosebud-language
Tagging rosebud-language
In systems which use tagging or other named classifications, please tag Rosebud related things as "rosebud-language". This will make it easier for others to search for related material.
Labels: rosebud-language
Monday, November 09, 2009
Coming Out of the Programming Language Designer Closet
I'm Joe Miklojcik, a programming language designer. I currently work as a system administrator for Shopwiki, Inc., and also do some private consulting under my own shingle Nectarine City LLC. Most programming language designers work as system administrators, I have found. I liken the situation to how most Actors work as waiters -- there are far more people who want to work Acting than there are roles to fill. The Actor's unions regularly report that less than 16% of their membership works at acting jobs at any given time during a year. Those who do work generally only work at acting for on average 17 weeks per year. Even if they had employment acting for 50 weeks per year, their overall income would be unsustainable. I don't mind system administration, and it's far more profitable than working as a waiter. It's just not my favorite thing to do.
The commercial need for new programming languages is far too slim to support many designers. Therefore, programming language designers often go into academia. I'm 40 years old, I need to make a certain amount of money every year to support my family and responsibly service my retirement, and I have a bachelor's degree in computer science (with honors, even). If you think you'd like me in your graduate program with a deal that gainfully employs me as well, I'm all ears. Barring that amazing occurrence, my options for entry into academia are slim.
The good news is that programming language designers don't need a lot of resources to design programming languages; just time, some textbooks, a laptop, and -- if one wishes to publish -- an Internet connection. Learning about programming language design makes you a better programmer, and a good systems person too. You can say things like "UNIX is: 'You are in a twisty little maze of 10,000 awful programming languages, all alike.'" and mean it, with great authority. You don't have to set your sights on designing a language that will be loved and have petalines of code written in it. You're free to design the best languages you can without a marketing department insisting that you make it "more like the javas".
I am currently working on the designs for two programming languages. The first, Phosphorus, is meant to satirize the Lisp community's constant wrestling with the question of why their language isn't more popular, given how incredibly awful today's popular languages are in comparison. There will probably never be an implementation of Phosphorus. The second, Rosebud, is an attempt to make a language that is strongly influenced by what contemporary Linguistics has to say about the human mind's innate language ability, while retaining other more traditional traits that make a useful programming language. I hope to one day construct at least a toy interpreter for Rosebud.
I have two "papers" on Phosphorus:
http://www.nectarine-city.com/phosphorus.pdf
http://www.nectarine-city.com/phosphoCon.pdf
and I have one paper on Rosebud:
http://www.nectarine-city.com/rosebud-desiderata.pdf
Thanks for reading.
The commercial need for new programming languages is far too slim to support many designers. Therefore, programming language designers often go into academia. I'm 40 years old, I need to make a certain amount of money every year to support my family and responsibly service my retirement, and I have a bachelor's degree in computer science (with honors, even). If you think you'd like me in your graduate program with a deal that gainfully employs me as well, I'm all ears. Barring that amazing occurrence, my options for entry into academia are slim.
The good news is that programming language designers don't need a lot of resources to design programming languages; just time, some textbooks, a laptop, and -- if one wishes to publish -- an Internet connection. Learning about programming language design makes you a better programmer, and a good systems person too. You can say things like "UNIX is: 'You are in a twisty little maze of 10,000 awful programming languages, all alike.'" and mean it, with great authority. You don't have to set your sights on designing a language that will be loved and have petalines of code written in it. You're free to design the best languages you can without a marketing department insisting that you make it "more like the javas".
I am currently working on the designs for two programming languages. The first, Phosphorus, is meant to satirize the Lisp community's constant wrestling with the question of why their language isn't more popular, given how incredibly awful today's popular languages are in comparison. There will probably never be an implementation of Phosphorus. The second, Rosebud, is an attempt to make a language that is strongly influenced by what contemporary Linguistics has to say about the human mind's innate language ability, while retaining other more traditional traits that make a useful programming language. I hope to one day construct at least a toy interpreter for Rosebud.
I have two "papers" on Phosphorus:
http://www.nectarine-city.com/phosphorus.pdf
http://www.nectarine-city.com/phosphoCon.pdf
and I have one paper on Rosebud:
http://www.nectarine-city.com/rosebud-desiderata.pdf
Thanks for reading.
Labels: rosebud-language
Monday, November 02, 2009
CALL FOR PAPERS -- phosphoConf 2010
We are very excited to announce the callForPapers for the upcoming phosphoConf 2010.
http://www.nectarine-city.com/phosphoCon.pdf
Thanks especially to our waxyWhite level sponsor, Nectarine City LLC.
Edit: for more information on Phosphorus, see our initial paper.
http://www.nectarine-city.com/phosphoCon.pdf
Thanks especially to our waxyWhite level sponsor, Nectarine City LLC.
Edit: for more information on Phosphorus, see our initial paper.
Labels: phosphorus
Sunday, October 18, 2009
Nectarine City Handbook of C Programming Style
My lulu store has the latest edition of my C programming sytle guide.
http://stores.lulu.com/store.php?fAcctID=1009550
There are more rules and examples than the last "Second 2008" edition, although the character and spirit of the guide remains essentially the same. The biggest changes are that I now advocate using capital L and U in numeric literals (as distinguishing l from 1 can be difficult) and that I hacked out most of the horror of LaTeX's default chapter formatting. You too can print something that looks sensible now with my new lulu.cls file.
http://nectarine-city.com/lulu.cls
As ever, the guide is licensed under a CC by/sa 3.0 license, the PDF is free to download from lulu.com, and lulu.com printed versions net me and lulu.com a few bucks each.
There will probably not be a 2010 edition, but instead I intend to expand the guide to include style rules for English, and perhaps also Common Lisp.
http://stores.lulu.com/store.php?fAcctID=1009550
There are more rules and examples than the last "Second 2008" edition, although the character and spirit of the guide remains essentially the same. The biggest changes are that I now advocate using capital L and U in numeric literals (as distinguishing l from 1 can be difficult) and that I hacked out most of the horror of LaTeX's default chapter formatting. You too can print something that looks sensible now with my new lulu.cls file.
http://nectarine-city.com/lulu.cls
As ever, the guide is licensed under a CC by/sa 3.0 license, the PDF is free to download from lulu.com, and lulu.com printed versions net me and lulu.com a few bucks each.
There will probably not be a 2010 edition, but instead I intend to expand the guide to include style rules for English, and perhaps also Common Lisp.
Labels: c, c style, nectarine city, programming, style

