CS340400 Compiler Design Homework 1

CS340400 Compiler Design Homework 1
4/12 (Wed) 23:59

Alfred Vaino Aho and Jeffrey David Ullman
Turing Award Winner of 2020
for fundamental algorithms and theory underlying programming language implementation

The Structure of a Modern Compiler

What are we Going to do?
source code a = b + c * d
tokens id = id + id * id
Lexical Analyzer
Syntax Analyzer
syntax tree =
generated code mul $r1, $r2, $r3 add 5$r0, $r1, $r4 …
Code Generator

The Structure of Compiler
source code a = b + c * d
tokens id = id + id * id
Lexical Analyzer
Syntax Analyzer
syntax tree =
generated code mul $r1, $r2, $r3 add 6$r0, $r1, $r4 …
Code Generator

程序代写 CS代考加微信: cstutorcs
– lexical analyser generator

What is Lex?
• Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions.
• The regular expressions are specified by the user in the source specifications given to Lex.
• Lex generates a deterministic finite automaton (DFA) from the regular expressions in the source.
• The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions.
From http://dinosaur.compilertools.net/lex/ 8

Input Stream
If (matches the DFA expression) Do Something…….

Lex with Yacc
Lex source (Lexical Rules)
Yacc source HW2 (Grammar Rules)
lex.yy.c call return token
Parsed Input

How to Write Lex?
Definition Section
Rules Section %%
C Code Section
(required) (required)
(optional)

How to Write Lex?
#include
int lineCount=0; %}
Definition Section
Rules Section %%
C Code Section
The Definition Section will be copied to the top of generated C program. Include header files, declare variables.

How to Write Lex?
\n { lineCount++; printf(“line:%d\n”, lineCount); }
Definition Section
Rules Section
C Code Section
The Rules Section is for writing regular expression to recognize tokens. When pattern is matched, then execute action
[Regular expression rule] { The things you want to do; }
程序代写 CS代考加QQ: 749389476
How to Write Lex?
int main(void) { yylex();
int yywrap() { return 1;
// Other function you defined.
Definition Section
Rules Section %%
C Code Section
The C Code Section will be copied to the bottom of generated C program.

How to Write Lex?
A completed Lex program
#include
int lineCount=0; %}
\n { lineCount++;
printf(“line:%d\n”, lineCount); }
int main(void){
return 0; } ……
Definition Section
Rules Section %%
C Code Section

Lex program scanner.l
C program test.c
Compilation Flow
We use Flex (Fast Lex) instead of Lex. $ flex scanner.l
$ gcc -o scanner lex.yy.c -lfl
GCC compiler
$ ./scanner < test.c Flex: the Fast Lexical Analyser Generator Link with library libfl.a From: http://oz.nthu.edu.tw/~d947207/chap10_lex.ppt Flex Example: Count Number of Lines and Number of Characters count_line.l Press Enter Press Ctrl+D 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T h i s i s a b o o k \n b y e b y e \n Generate source C-code lex.yy.c Library libfl.a From: http://oz.nthu.edu.tw/~d947207/chap10_lex.ppt Grammar of Input file of Flex Lex copy data enclosed by %{ and %} into C source file { ++num_lines ; ++ num_chars ; } pattern \n character expect line feed \n grammar of input file { ++ num_chars ; } wild card character, represent any definition section rule section pattern action When pattern is matched, then execute action 19 Can we Compile lex.yy.c without –lfl ? We want to use lex.yy.c on different platforms (Linux and windows), to avoid specific library is lesson one. Library libfl.a contains function yywrap() -lfl means “include library libfl.a”, this library locates in /usr/lib contains function main() and yywrap() From: http://oz.nthu.edu.tw/~d947207/chap10_lex.ppt 20 From: http://oz.nthu.edu.tw/~d947207/chap10_lex.ppt Can we Compile lex.yy.c without –lfl ? count_line.l Implement function yywrap explicitly How to Process a file? count_line.l lex.yy.c yyin is a file pointer in lex, function yylex() read characters from yyin From: http://oz.nthu.edu.tw/~d947207/chap10_lex.ppt 22 Lex Predefined Variables char *yytext Pointer to matched string. int yyleng Length of matched string. int yylex(void) Function call to invoke lexer and return token. int yywrap(void) Return 1 if no more files to be read. char *yymore(void) Return the next token. int yyless(int n) Retain the first n characters in yytext and (sort of) return the rest back to the input stream. FILE *yyin Input stream pointer. FILE *yyout Output stream pointer. Print out the yytext. Condition switch. Go to the next alternative rule. man flex or info flex to get more info Regular Expressions Any character excepts ‘\n’. . = {a, b, c, d, ......} Zero or more. ab* = {a, ab, abb, abbb, ......} One or more. ab+ = {ab, abb, abbb, ......} Zero or one. a? = {ԑ, a} a|b = {a, b} Any character of the character set. [abc] = {a, b, c} To group characters. (ab)* = {ԑ, ab, abab, ......} For escape character. \* = {*}, \\ = {\} Literally. "a*" = {a*} Repeat n to N times. a{1,3} = {a, aa, aaa} Not these characters. (Opposite of []) [^abc] = {d, e, f, ......} Head/End of line. ^a = a... // line starts with a Followed by specific character.24 a/b = {ab} // but only returns a Regular Expressions Input String she { printf(“she\t”); } [sS]he { printf(“another she\t”); } he { printf(“he\t”); } s { printf(“s\t”); } Rule Section The output result is "she": 1. Choose the longest matching pattern. If there's multiple matches of the same length, then 2. Choose the first matching pattern from top to bottom More Elegant Way to Write Regular Expressions #include
int lineCount=0; %}
\n { lineCount++;
printf(“line:%d\n”, lineCount) ;}
{ch}+ { ECHO; }

More Elegant Way to Write Regular Expressions
#include
int lineCount=0; %}
\n { lineCount++;
printf(“line:%d\n”, lineCount); } [[:alpha:]]+ { ECHO; }

Regular Expressions
Regular Expression
Any character of a ~ z and A ~ Z.
Any character of 0 ~ 9.
[[:lower:]] = [a-z]
[[:upper:]] = [A-Z]
[[:alpha:]] = [a-zA-Z]
[[:digit:]] = [0-9]
[[:alnum:]] = [a-zA-Z0-9]

Start Condition
• What if you encounter the string like this? Input
Input String
/* int count
is for counting line number */
printf( “int is 32-bit” );

%x COMMENT %%
Start Condition
• Declare at Definition Section • %s STATE_NAME – inclusive
– If the start condition is inclusive, then rules with no start conditions at all will also be active.
• %x STATE_NAME – exclusive
– If it is exclusive, then only rules qualified with the start condition will be active.

Input String
Start Condition
%x COMMENT /* Exclusive */
“/*” { BEGIN COMMENT; } int { printf(“normal\n”);} int { printf(“special\n”);
BEGIN 0; }

Input String
Start Condition
%s COMMENT /* Inclusive */
“/*” { BEGIN COMMENT; } int { printf(“normal\n”);} int { printf(“special\n”);
BEGIN 0; }

Versions of Lex
• AT&T:lex http://www.combo.org/lex_yacc_page/lex.html
• GNU:flex http://www.gnu.org/manual/flex-2.5.4/flex.html
• a Win32 version of flex http://www.monmouth.com/~wstreett/lex-yacc/lex-yacc.html
http://sources.redhat.com/cygwin/
• Lex on different machines is not created equal. 33

Homework1 – Requirements

Subset of C Language
• Character Set of Testcases • ASCII characters
• Only those in the right image • ‘\n’ and ‘\t’
• Maximum Length of each Line: 299

Subset of C Language
• Implement: Keywords
• for, do, while, break, continue, if, else, return, struct, switch, case,
• void, int, double, float, char
• const, signed, unsigned, short, long
• Implement: Macros
• NULL, __COUNTER__, __LINE__, INT_MAX, INT_MIN, CHAR_MAX, CHAR_MIN, MAX, MIN

Subset of C Language
• Implement: Identifiers (case-sensitive)
• Follow the standard C variable naming rule: A variable name can only have letters (both uppercase and lowercase letters), digits and underscore. The first letter of a variable should be either a letter or an underscore.
• Implement: Operators
• + – * / % ++ — < <= > >= == != = && || ! & |
• Implement: Punctuation characters •:;,.[](){}

Subset of C Language
• Implement:
• Integer constants (e.g. 0, -0, 1, 123, 45, -2131)
• There can be a deliberate ‘+’ preceding positive numbers – Note: “- 0” is “-” (op) and “0” (integer), while “-0” is an integer
• Simple form floating point constants (e.g. 0.0, 0.1234, 123.456, -0.0, – 0.1234)
• There can be a deliberate ‘+’ preceding positive numbers – Note: “- 0.” is “-” (op) and “0.” (double), while “-0.” is a double
• The number, if it’s 0, before or after the decimal point can be missing 38

Subset of C Language
• Implement:
• String constants (e.g. “This is a string”)
• Take particular note of escaped sequences ()
• Character constants (e.g. ‘a’, ‘b’, ‘\t’, ‘\0’)
• Also take particular note of escaped sequences ()
• C Comments
• both the // … and /* … */ syntax

Subset of C Language
• Implement: Pragma directives • #pragma source on
• #pragma source off
• #pragma token on
• #pragma token off • Note
• Pragmas could have spaces and/or ‘\t’ on the same line and between words in them, but not between the hash (‘#’) and “pragma”
• Pragmas should be on their own line 40

Subset of C Language
• Always parse with the rule that matches the longest input
• E.g. “ab123” is an Identifier (“ab123”, length 5), not Identifier (“ab”, length 2) and Integer (“123”, length 3)

Output Format
Token type
• Keyword (key): Refer to slide page 35
• Macro (macro): Refer to slide page 35
• Identifier (id): Refer to slide page 36
• Operator (op): Refer to slide page 36
• Punctuation Character (punc): Refer to slide page 36
• Integer (integer): Refer to slide page 37
• Floating Point (float): Refer to slide page 37
• Char (char): Refer to slide page 38
• String (string): Refer to slide page 38
One must print the token types with the type names designated in the parentheses once a token of these types is encountered.

Output Format
• One must print the result in this format • For each line of input
• If the extracted token is a pragma directive or part of a comment, print nothing except the line information (see below)
• Otherwise, print “#” first, then the token type and token content (`#${token_type}:${token_content}`)
• Quotes of strings and characters should be retained
• Finally, print the line number and content at the end of each input line (`${line_number}:${line_content}`)
• #token_type1:token_content1 • #token_type2:token_content2 • …
• line_number:line_content

Output Format Examples: Testcase0 Line1
char a = ‘i’;
Output Format Examples: Result
#key:char #id:a
#op:= #char:’i’ #punc:; 1:char a = ‘i’;
Please use diff command to check your output against the golden scanner!

Pragma Directives: Source
#pragma source on char a = ‘i’;
#pragma source off char a = ‘i’;
1:#pragma source on #key:char
#char:’i’ #punc:; 2:char a = ‘i’;
#key:char #id:a #op:= #char:’i’ #punc:;

Pragma Directives: Token
#pragma token on char a = ‘i’;
#pragma token off char a = ‘i’;
1:#pragma token on
#key:char #id:a
#char:’i’ #punc:;
2:char a = ‘i’;
1:#pragma token off
2:char a = ‘i’;

Grading Policies
• For all homeworks
• Any warning during compilation: -20 points penalty
• Late Submission: -10 points penalty/24H
• Applied at the last step after all other penalties
• Executable, but not complying to specifications: 20% off your original score if you
apply for a manual review (Reviews are not guaranteed to be accepted.) • Includingbuildscriptfailures
• Non-executable: A flat 30 point if you turn in your source code and a report
• Refer to later slides for report requirements
• Plagiarism/Cheating: A flat 0 point even if you do everything you can do 47

Grading Policies
• If your scanner behaves correctly on the following type: • Keywords && Identifiers: +10
• Macros: +10
• Operators && Punctuations: +15
• Integers && Floating Points && Characters: +15 • Strings: +15
• Comments: +15
• Pragma Directives: +10
• ComprehensiveTestcases:+10
• A sample testcase is released for reference
• Use `./scanner < sample_testcase.txt` to test your scanner 48 Code Help, Add WeChat: cstutorcs • For students who cannot finish his homework, he can turn in a report answering ALL of the following topics: • Describe Lex, including its features and possible purposes • Describe the problems you faced when implementing your scanner • Describe your guess/understanding on the nature of those problem, and what you did to solve them • For those who successfully passes at least 1 testcase, no report is required Submission • HW1 Lex Implementation: Submit on Server • Create a `hw1` directory in your home, and provide in it: • A lex source code file named `scanner.l`. E.g. `/home/104062634/hw1/scanner.l` • A makefile to compile your code. It should only refer to resources in your `hw1` directory using relative paths. E.g. `/home/104062634/hw1/makefile` • The compiled output must be named `scanner` with the executable bit set • Makesureyourprogramworkscorrectlyintheserverenvironment • Report + Project Source Code: eeclass (if you can't finish your implementation) • Upload your report (PDF) and source code (ZIP) to eeclass 50 How to Connect to our Server? • SSH Protocol • IP: 140.114.88.201 • Port: 8787 • Username: (Will be released) • Default Password: (Will be released) • One can change password by entering `passwd` • Windows: PuTTY, MobaXterm, ... • Linux, Mac OS: Built-in ssh Linux Materials • Linux Command • https://linux.vbird.org/linux_basic/centos7/0210filepermission.php • https://linux.vbird.org/linux_basic/centos7/0310vi.php • Shell Script • https://linux.vbird.org/linux_basic/centos7/0340bashshell-scripts.php • Makefile • http://www.cprogramming.com/tutorial/makefiles.html • http://jimmynuts.blogspot.tw/2010/12/gnu-makefile.html • lex & yacc • by John R.Levine, Tony Mason & Doug Brown • O’Reilly • ISBN:1-56592-000-7 • Mastering Regular Expressions • by Jeffrey E.F. Friedl • O’Reilly • ISBN:1-56592-257-3 Contact Us