The Programming Language DINO: Predeclared identifiers Next Previous Contents

9. Predeclared identifiers

Dino has quite a lot of predeclared identifiers. They are combined in in a few signleton objects also called spaces -- see the section Declarations and Scope Rules. Most of predeclared identifiers refer for functions. The predeclared functions expect a given number of actual parameters (may be a variable number of parameters). If the actual parameter number is an unexpected one, the exception parnumber is generated. The predeclared functions expect that the actual parameters (may be after implicit conversions) are of the required type. If this is not true, the exception partype is generated. To show how many parameters the function requires, we will write the names of the parameters and use the brackets [ and ] for the optional parameters in the description of the functions.

Examples: The following description

          strtime ([format [, time]])

describes that the function can accept zero, one, or two parameters. If only one parameter is given, then this is parameter format.

If nothing is said about the returned result, the function return value is undefined.

The predeclared identifiers are describe below according to their spaces.

9.1 Space lang

The space contains fundamental Dino declarations. All declarations of the space are always exposed.

Predeclared variables

Space lang has some predeclared variables which contain useful information or can be used to control the behaviour of the Dino interpreter.

Arguments and environment

To access arguments to the program and the environment, the following variables can be used:

Versions

As Dino is a live programming language, it and its interpreter are in the process of development. To access the Dino interpreter's version number and the language version, the final variables version and lang_version can be used correspondingly. The variable values are the versions as floating point numbers. For example, if the current Dino interpreter version is 0.97 and the Dino language version is 0.5, the variable values will be 0.97 and 0.5.

Threads

To access some information about threads in Dino program, the following variables can be used.

All these variables are final, so you can not change their values.

Exception classes

All predeclared classes in the space lang describe exceptions which may be generated in a Dino program. All Dino exceptions are represented by objects of the predeclared class except or of a sub-class of the class except. The class except has no parameters. There is only one predeclared sub-class error of the class except. All classes corresponding to user-defined exceptions are suggested to be declared as a sub-class of except. All other exceptions (e.g. generated by the Dino interpreter itself or by predeclared functions) are objects of the class error or predeclared classes which are sub-classes of error. The class error and all its sub-classes has one parameter msg which contains a readable message about the exception. The following classes are declared in the space lang as a sub-class of error:

Functions of the space lang

The following functions are declared in the space lang:

9.2 Space io

The space contains functions for input and output and for work with files and directories. All declarations of the space are always exposed.

Exception classes of the space io

The following classes are declared in the space io as sub-classes of invcall:

Class file

Dino has a predeclared final class file. Work with files in a Dino program are made through objects of the class. All declarations inside of the class are private. The objects of the class can be created only by the predeclared functions open or popen. If you create an object of the class by calling the class, the exception callop will be generated. The file encoding is defined by the current DINO encoding at the file creation time (see thefunctions set_encoding, set_file_encoding). If you want to work with files on the byte level without any encoding/decoding, you can use an encoding called "RAW".

Files

To output something into the standard output streams or to input something from the standard input stream, the following variables can be used:

All these variables are final, so you can not change their values. Encoding of the files is DINO current encoding at the program start (see the function set_encoding).

Functions for work with files

The following functions (besides the input/output functions) work with OS files. The functions may generate an exception declared in the class syserror (e.g. eaccess, enametoolong, eisdir and so on) besides the standard partype, and parnumber. The function rename can be used for renaming a directory, not only a file.

File output functions

The following functions are used to output something into opened files. All the function return values are undefined. The functions may generate an exception declared in the class syserror (e.g. eio, enospc and so on) besides the standard partype and parnumber.

File input functions

The following functions are used to input something from opened files. The functions may generate an exception declared in the class syserror (e.g. eio, enospc and so on) or eof besides the standard partype, and parnumber.

Encoding functions

Dino internally uses Unicode for characters. To provide a communication with the rest of world, it can use different encodings. The default encoding is UTF-8. Dino has two functions to get and change the current encoding:

Examples:

          putln (get_encoding ());
          set_encoding ("KOI8-R");

Functions for work with directories

The following functions work with directories. The functions may generate an exception declared in the class syserror (e.g. eaccess, enametoolong, enotdir and so on) besides the standard partype and parnumber.

Functions for access to file/directory information

The following predeclared functions can be used for accessing file or directory information. The functions may generate an exception declared in the class syserror (e.g. eaccess, enametoolong, enfile and so on) besides the standard partype and parnumber. The functions expect one parameter which should be a file instance (see the predeclared class file) or the path name of a file represented by a string (the functions make an implicit string conversion of the parameter value). The single exception to this is isatty which expects a file instance.

The following functions can be used to change the rights of usage of the file (directory) for different users. The function expects two strings (after an implicit string conversion). The first one is the path name of the file (directory). The second one is the rights. For instance, if the string contains the character 'r', this is a right to read (see characters used to denote different rights in the description of the function fumode). The function return values are always undefined.

Miscellaneous functions

There are the following miscellaneous functions in space io:

9.3 Space sys

This space contains declarations to work with the underlying execution environment (OS) and related exceptions.

Exceptions in space sys

The space contains a lot of exceptions:

Variable time_format

The variable value is a string which is the output format of time used by the function strtime when it is called without parameters. The initial value of the variable is the string "%a %b %d %H:%M:%S %Z %Y".

Time functions

The following functions from the space sys can be used to get information about real time.

Functions for access to information about OS processes

The space sys contains predeclared functions which are used to get information about the current OS process (the Dino interpreter which executes the program). Each OS process has unique identifier and usually the OS processes are called by a concrete user and group and are executed on behalf of the concrete user and group (so called effective identifiers). The following functions return such information. On some OSes the function may return string "Unknown" as a name if there are no notions of user and group identifiers.

Function system (command)

The function executes the command given by a string (the parameter value) in the OS command interpreter. Besides the standard exceptions parnumber and partype the function may generate the exceptions noshell and systemfail.

9.4 Space re

This space contains declarations which can be useful for working with the regular expressions and for pattern matching -- see also the match-statements.

Exception class invregex

This class describes exceptions specific for executing the pmatch-statement and for calling predeclared functions implementing regular expression pattern matching. Although there is only one class for this, the messages which are in the class parameter can be different and explain more details.

Variable split_regex

The variable value is a string which represents a regular expression which is used by the predeclared function split when the second parameter is not given. The initial value of the variable is the string "[ \t]+".

Pattern matching

The space re contains predeclared functions which are used for pattern matching. The pattern is described by regular expressions (regex) and actually a small program describing a string matching. The pattern has default syntax of ONIGURUMA package for Unicode. It is hard to describe formally the pattern syntax. Here is an incomplete strict description. For the full reference, please see OINGURUMA package documentation. The regular expressions have the following syntax:

          Regex = Branch {"|" Branch}

The regex matches anything that matches one of the branches.

          Branch = {Piece}

The branch matches the first piece, followed by the second piece, etc. If the pieces are omitted, the branch matches the null string.

          Piece = Anchor | Unit

          Unit = Atom
               | Unit Quantifier

          Quantifier = Greedy
                     | Reluctant
                     | Possesive

          Greedy = "?"                 // 0 or 1 times
                 | "*"                 // 0 or more times
                 | "+"                 // 1 or more times
                 | Bound

          Bound = "{" Min "," Max "}" // from Min to Max times
                | "{" Min "," "}"     // at least Min times
                | "{" "," Max "}"     // equivalent to {0, Max}
                | "{" Min "}"         // given number times

          Reluctant = "??"
                    | "*?"
                    | "+?"
                    | Bound "?"

          Possesive : "?+"
                    | "*+"
                    | "++"

          Min = <unsigned integer>

          Max = <unsigned integer>

The unit followed by * matches a sequence of 0 or more matches of the unit. An unit followed by + matches a sequence of 1 or more matches of the unit. An unit followed by ? matches a sequence of 0 or 1 matches of the unit.

There is a more general construction (a bound) for describing repetitions of an unit. An unit followed by a bound containing only one integer Min matches a sequence of exactly Min matches of the unit. An unit followed by a bound containing one integer Min and a comma matches a sequence of Min or more matches of the unit. An unit followed by a bound containing a comma and one integer Max matches at most Max repetitions of the unit. An unit followed by a bound containing two integers Min and Max matches a sequence of Min through Max (inclusive) matches of the unit.

The described above qualifiers are greedy ones. A gready qualifier first matches as much as possible and can back-track in a case of the whole regex matching failure to try shorter sequence. There are reluctant qualifiers too. They have additional suffix ? and first they match as little as possible. The last type of the qualifiers is possesive. Such qualifiers have additional suffix + and behave like the corresponding greedy ones, but they do not back-track.

Examples:

          `.?foo` // matches first "xfoo" in "xfooxxxxfoo"
          `.*foo` // matches all "xfooxxxxfoo"
          `.+foo` // matches all "xfooxxxxfoo"
          `.{1,8}foo` // matches all "xfooxxxxfoo"
          `.*?foo` // matches first "xfoo" in "xfooxxxxfoo"
          `.+?foo` // Ditto
          `.{1,8}?foo` // Ditto
          `.*+foo` // fail to match in "xfooxxxxfoo"
          `.++foo` // fail to match in "xfooxxxxfoo"
         Atom =  Anchors
               | Character
               | CharacterType
               | CharacterProperty
               | CharacterClass
               | Group
               | BackReference
               | SubexpCall
         
          Character = "\t"     // horizontal tab (0x09)
                    | "\v"     // vertical tab (0x0B)
                    | "\n"     // newline tab (0x0A)
                    | "\r"     // return (0x0D)
                    | "\f"     // form feed (0x0C)
                    | "\a"     // bell (0x07)
                    | "\e"     // escape (0x1B)
                    | "\" OctalCode // char with given octal code
                    | "\x" HexCode  // char with given hexadecimal code
                    | <any but special character \ ? * + ^ $ [ ( ) >
                    | "\" <special character>
      
          OctalCode = <3 octal digits>
        
          HexCode = <2 heaxadecimal digits>
          
          CharacterType = '.'  // any character but newline
                        | "\w" // Unicode Letter, Mark, Number, or
                               //   Connector_Punctuation
                        | "\W" // opposite to the above 
                        | "\s" // Unicode Line_Separator, 
                               //   Paragraph_Separator, or
                               //   Space_Separator
                        | "\S" // opposite to the above 
                        | "\d" // Unicode decimal number 
                        | "\D" // opposite to the above 
                        | "\h" // hexadecimal digit char [0-9a-fA-F] 
                        | "\H" // opposite to the above 

          CharacterProperty = "\p{" PropertyName "}"
                            | "\p{^" PropertyName "}"
                            | "\P{" PropertyName "}"

         PropertyName = "Alnum" | "Alpha" | "Blank" | "Cntrl"
                      | "Digit" | "Graph" | "Lower" | "Print"
                      | "Punct" | "Space" | "Upper" | "XDigit"
                      | "Word" | "ASCII"
                      | "Any" | "Assigned" | "C" | "Cc" | "Cf"
                      | "Cn" | "Co" | "Cs" | "L" | "Ll" | "Lm"
                      | "Lo" | "Lt" | "Lu" | "M" | "Mc" | "Me"
                      | "Mn" | "N" | "Nd" | "Nl" | "No" | "P"
                      | "Pc" | "Pd" | "Pe" | "Pf" | "Pi" | "Po"
                      | "Ps" | "S" | "Sc" | "Sk" | "Sm" | "So"
                      | "Z" | "Zl" | "Zp" | "Zs" | "Arabic"
                      | "Armenian" | "Bengali" | "Bopomofo"
                      | "Braille" | "Buginese" |  "Buhid"
                      | "Canadian_Aboriginal" | "Cherokee"
                      | "Common" | "Coptic" | "Cypriot"
                      | "Cyrillic" | "Deseret" | "Devanagari"
                      | "Ethiopic" | "Georgian" |  "Glagolitic"
                      | "Gothic" | "Greek" | "Gujarati"
                      | "Gurmukhi" | "Han" | "Hangul" | "Hanunoo"
                      | "Hebrew" | "Hiragana" | "Inherited"
                      | "Kannada" | "Katakana" | "Kharoshthi"
                      | "Khmer" | "Lao" | "Latin" | "Limbu"
                      | "Linear_B" | "Malayalam" | "Mongolian"
                      | "Myanmar" | "New_Tai_Lue" | "Ogham"
                      | "Old_Italic" | "Old_Persian" | "Oriya"
                      | "Osmanya" | "Runic" | "Shavian" | "Sinhala"
                      | "Syloti_Nagri" | "Syriac" | "Tagalog"
                      | "Tagbanwa" | "Tai_Le" | "Tamil" | "Telugu"
                      | "Thaana" | "Thai" | "Tibetan" | "Tifinagh"
                      | "Ugaritic" | "Yi"

          Anchors = "^"           // beginning of the line
                  | "$"           // end of the line
                  | "\b"          // word boundary
                  | "\B"          // not word boundary
                  | "\A"          // beginning of string
                  | "\Z"          // end of string, or before newline
                                  //   at the end
                  | "\z"          // end of string

The atom can be a character. Some characters has a special meaning in regex (see comments in the character syntax). The rest characters match the same character in the matching string. To match a special character, use \ before the character. Some characters can be represented by a sequence starting with \ (see the syntax comments).

Examples:

          `\t`        // matches "\\t"
          `\x65`      // matches "e"
          `\p{Alpha}` // matches "a"
          `\w`        // matches "a"
          `b$`        // matches "b" in "b\na"

The atom can be an anchor. Matching anchors succeeds only if their positions correspond a specific place at the matching string (see comments in the anchor syntax).

Examples:

          `b$`        // matches "b" in "b\na"
          `abc\Z`     // matches "abc" in "abc"
          `abc\Z`     // matches "abc" in "abc\n"

The atom which is a character type matches a specific class of character (see comments in the character type syntax).

The atom which is a character property matches a specific class of characters. For meaning Alnum - ASCII, please see the corresponding BracketClass. For meaning C - Zs, please see Unicode categories. For meaning Armenian - Yi, please see the Unicode scripts (alphabets). If the property contains p with ^ or P, the match succeeds when the matching character is not of the class.

Examples:

          `\p{Alpha}` // matches "a"
          `\p{ASCII}` // matches ";"

          CharacterClass = "[" Intersections "]"
                         | "[^" Intersections "]"

          Intersections = Set
                        | Intersections "&&" Set

          Set = SetElement
              | Set SetElement
          
          SetElement = ElementChar ["-" ElementChar]
                     | "[:" BracketClass ":]"
                     | "[:^" BracketClass ":]"
                     | CharacterClass

          ElementChar = Character
                      | "\b"       // backspace 0x08

          BracketClass = "alnum"   // Unicode letter, mark,
                                   //   or decimal number
                       | "alpha"   // Unicode letter or mark
                       | "ascii"   // character in range 0 - 0x7f
                       | "blank"   // Unicode space separator
                                   //   or \t (0x09)
                       | "ctrl"    // Unicode control, format,
                                   //   unassigned, private use,
                                   //   or surrogate
                       | "digit"   // Unicode decimal number 
                       | "graph"   // not a space class and not an
                                   //   Unicode control, unassigned,
                                   //   or surrogate
                       | "lower"   // Unicode lower case letter
                       | "print"   // graph or space class
                       | "punct"   // any Unicode punctuation
                       | "space"   // any Unicode separator,
                                   //   \t (0x09), \n (0x0A), \v (0x0B),
                                   //   \f (0x0C), \r (0x0D),
                                   //   or 0x85 (next line)
                       | "upper"   // Unicode upper case letter
                       | "xdigit"  // ascii 0-9, a-f, or a-f
                       | "word"    // Unicode letter, mark, decimal
                                   //   number or punctuation connector

The atom can be a bracket expression which is a list of intersections of character sets separated by && and enclosed in []. If the character class contains ^ right after [, it matches any character which does match the corresponding character class without ^. A set is a sequence of set elements.

The element given by a character denotes the character itself. An element given by two characters in the list separated by - is shorthand for the full range of characters between those two (inclusive) in the sequence of the unicode codes, e.g. [0-9] matches any decimal digit. Besides the usual character representation you can use here also \b which is a backspace representation.

The element given by a bracket class enclosed in [[::]] matches a character from this class (see comments in BracketClass). If character ^ is present right after [[:, the match succeeds if the character is not in this class.

The element can be given by a character class, in other words the character clases can be nested.

If you need to use [, -, or ] as a normal character in a character class, you can use prefix \ for this.

Examples:

          `[[:alpha:]]`  // matches "a"
          `[[[:lower:]]&&[^a-x]]` // matches "y" or "z"

The atom can be a group, a regular expression enclosed in (). There are several types of groups:

          Group = CapturedGroup
                | NonCapturedGroup
                | "(?#" <any characters but )> ")" // a comment
                | "(?" Options ")"
                | Context
          
          Options =
                  | Options Option

          Option = "-" | "i" | "m" | "x"

          CapturedGroup = "(" [Regex] ")"
                        | "(?<" Name ">" [Regex] ")"
 
          Name = <one or more word character>

          NonCapturedGroup = "(?" Options ":" [Regex] ")"
                           | "(?>" [Regex] ")" /* Atomic group */
                           
          Context = "(?=" [Regex] ")" // look ahead
                  | "(?!" [Regex] ")" // negative look ahead
                  | "(?<=" [Regex] ")" // look behind
                  | "(?<!" [Regex] ")" // negative look behind

          BackReference = "\" Number    // back ref. by group
                                        //   number
                        | "\k<" Number ">" // back ref. by group
                                           //   number
                        | "\k<-" Number ">" // back ref. by relative
                                            //   group number
                        | "\k<" Name ">" // back ref. by group name
                        // back ref. by group name and nest level:
                        | "\k<" Name "+" | "-" Number ">"

          Number = <any integer >= 0>
                                    

Some groups are captured groups. It means that you can refer the substrings they match (see the back references) or get the start and the end positions of the matched substrings by calling the Dino regex match functions. A captured group may have a name which can be used in the back references or in the subexp calls.

You can place comments not containing ) in regex betweeen (?# and ).

Options without a regex always matches. They just change how matching works. The option i switches on igoring the letter cases during the match. The pption m makes . to match a newline too. The option x switches on ignoring the white spaces as a character atom and permits to add comments starting with # and ending at the end of line. The character - after the corresponding ? has an opposite effect, e.g. it makes a letter case important in matching again etc.

You can define the options in non captured groups. These options affect only this group. Another form of non-captured group is an atomic group. Once regex in an atomic group mathes something, the matching stays the same during back-tracking.

Examples:

          `(?i:ab)`     // matches "Ab"
          `(?x: a a a)` // matches "aaa"
          `(?>.*)c`     // can not match "abc"

The atom can be a context. A context match does not advance the current position in a matching string. A look ahead context succeeds if the corresponding regex matches a sub-string starting from the current position. A look behind context succeeds if the corresponding regex matches a sub-string finishing right before the current position. There are negative forms of the context atom. They succeed when the corresponding regex does not match.

Examples:

          `(?=bcd)bc`   // matches "bc" in "aabcd"
          `(?<=aa)bc`   // matches "bc" in "aabc"

The atom can be a back reference. It refers to the matched string of the corresponding captured group. The captured groups are counted by their left parantheses starting from one going from left to right. The negative number denotes relative order number, in other words, the order is taken starting from the back reference going from right to left. If the captured group has a name, its matched string can be referenced by its name. If several group has the same name, the name in the back reference corresponds to the last such group. You can add a nest level to the name. If the nest level is zero it is the same as named back reference without nested level. A back reference with non-zero nest level never matches.

Examples:

          `(a)\k<1>`     // matches "aa"
          `(?<p>a)\k<p>` // Ditto

The Atom can be a subexp call:

          SubexpCall = "\g<" Name ">"

The subexp call is actually another occurence of the group it refers to. But if the call is in the group it refers, it is a recursive description. Only left recursion is not permitted as this results in never ending recursion.

Examples:

          `(?<p>cd)\g<p>`   // matches "cdcd"
          `(?<p>a|b\g<p>c)` // matches "a", "bac", "bbacc" etc
          `(?<p>a|b\g<p>c)` // wrong left recursion.

There are the following pattern matching functions in the space re:

If the regular expression is incorrect, the functions generate the exception invregex with a message explaining the error.

9.5 Space math

The space contains mostly mathematical functions.

Mathematical functions

The following functions make an implicit arithmetic conversion of the parameters. After the conversions the parameters are expected to be of integer, long integer, or floating point type. The result is always a floating point number.

Other space math functions

There are the following miscellaneous functions:

9.6 Space yaep

This space contains declarations to work with Yet Another Earley Parser (YAEP). YAEP is a very powerful tool to implement language compilers, processors, or translators. The implementation of the Earley parser used in Dino has the following features:

Exception classes of space yaep

The space yaep contains the class invparser which is a sub-class of invcall. The following sub-classes of the class invparser describe exceptions specific for the work with YAEP.

Class parser

The space yaep has the predeclared final class parser which implements Earley parser. The following public functions and variables are declared in the class parser:

The call of the class parser itself can generate the exception pmemory if there is no memory for the internal parser data.

Class token

The space yaep has a predeclared class token. Objects of this class should be the input of the Earley parser (see the function parse in the class parser). The result abstract tree representing the translation will have input tokens as leaves. The class token has one public variable code whose value should be the code of the corresponding terminal described in the grammar. You could extend the class description e.g. by adding variables whose values could be attributes of the token (e.g. a source line number, the name of an identifier, or the value for a number).

Class anode

The space yaep has a predeclared class anode whose objects are nodes of the abtract tree representing the translation (see teh function parse of class parser). Objects of this class are generated by Earley parser. The class has two public variables name whose value is a string representing a name of the abstract node as it given in the grammar and transl whose value is an array with abstract node fields as the array elements. There are a few node types which have special meaning:

Variables nil_anode and error_anode

There is only one instance of anode which represents empty (nil) nodes. The same is true for the error nodes. The final variables nil_anode and error_anode correspondingly refer to these nodes.

Example of Earley parser usage.

Let us write a program which transforms an expression into the postfix polish form. Please, read the program comments to understand what the code does. The program should output string "abcda*+*+" which is the postfix polish form of input string "a+b*(c+d*a)".

          expose yaep.*;
          // The following is the expression grammar:
          var grammar = "E : E '+' T   # plus (0 2)\n\
                           | T         # 0\n\
                           | error     # 0\n\
                         T : T '*' F   # mult (0 2)\n\
                           | F         # 0\n\
                         F : 'a'       # 0\n\
                           | 'b'       # 0\n\
                           | 'c'       # 0\n\
                           | 'd'       # 0\n\
                           | '(' E ')' # 1";
          // Create the parser and set up the grammar.
          var p = parser ();
          p.set_grammar (grammar, 1);

          // Add attribute repr to the token:
          class our_token (code) { use token former code; var repr; }
          // The following code forms input tokens from the string:
          var str = "a+b*(c+d*a)";
          var i, inp = [#str : nil];
          for (i = 0; i < #str; i++) {
            inp [i] = our_token (str[i] + 0);
            inp [i].repr = str[i];
          }
          // The following function outputs messages about the syntax errors
          // and the syntax error recovery:
          fun error (err_start, err_tok,
                      start_ignored_num, start_ignored_tok_attr,
                      start_recovered_num, start_recovered_tok) {
            put ("syntax error on token #", err_start,
                 " (" @ err_tok.code @ ")");
            putln (" -- ignore ", start_recovered_num - start_ignored_num,
                   " tokens starting with token #", start_ignored_num);
          }

          var root = p.parse (inp, error); // parse

          // Output the translation in the polish inverse form
          fun pr (r) {
            var i, n = r.name;

            if (n == "$term")
              put (r.transl.repr);
            else if (n == "mult" || n == "plus") {
              for (i = 0; i < #r.transl; i++)
                pr (r.transl [i]);
              put (n == "mult" ? "*" : "+");
            }
            else if (n != "$error") {
              putln ("internal error");
              exit (1);
            }
          }

          pr (root);
          putln ();


Next Previous Contents