Generate Text with Probabilities

Background

Some of you may be familiar with the “chimpanzees and typewriters” story. Imagine aroom full of chimpanzees, each sitting before a typewriter, diligently pecking away at thekeys in a completely random manner. In principle, at least, if we were prepared to waitlong enough, it is not beyond the realm of possibility that eventually we might find thecomplete works of Shakespeare among mountains of garbage that might have beenproduced. In the electronic age, we can easily simulate the performance of achimpanzee, by using a computer actually generating a valid English word in this way ina reasonable amount of time is slim, never mind a weighty piece of text.

A better approach in generating random English text would be to compute thefrequencies of the letters between A and Z and the blank character, in typical English text. By choosing letters withprobability equal to their frequencies, we might hope to generate text that more closelyapproximates English than the chimpanzee approach. The travesty game generalizesthis idea even further, to become a form of computerized solitaire. The game datesback to work done in the late 1940’s by Claude Shannon, and has been the subject ofseveral amusing articles. Also recently, applications of travesty have also beenexplored, in areas such as coding of text, and in word-processing.

Your assignment is to program the travesty game, which is described in some detail inthe next section. The assignment is designed to refresh your memory on arrays,characters, andprocedural programming; and get you familiarized with the Linuxsystem on hills. If you apply the principles of data abstraction in this implementation,you will be able to re-use much of your code in the later version.

Description of the Travesty Game

A selection of English text is given as input. Instead of computing the frequencies ofsingle characters from the input text, we compute for each prefix of some fixed length,the frequency of letters immediately following that prefix. That is, we set a constantcalled prefix_size, and for each prefix of length prefix_sizethat occurs in thetext, and each character suffix s, we compute the probability that s immediately followsthe prefix in the text. This is the frequency of s with respect to the prefix.

For example, if the prefix_sizewere 1, we would have 27 prefixes, one for eachletter and one for the blank character. With respect to the prefix Q, the frequency of theletter U would be 1 (or very close), while all other letters would have frequency zero (orvery close), reflecting the fact that the prefix Q is very likely to be followed by U. Asanother example, if the prefix_sizewere 2, we would have at most (27 * 27)prefixes. With respect to the prefix TH, we would expect the frequency of the letter E tobe high and the frequency of the letter Z to be zero.

Having compute the frequencies, the output text is constructed as follows:
1. choose a prefix randomly, with probability equal to its frequency and print itCS110B programming assignment#1: The Travesty Game due on Oct. 19, 2017! 1

  1. w.r.t. prefix, choose a suffix character s with probability equal to its frequency andprint s
    3. make a new prefix by removing the leftmost character from prefix and appending s tothe right end of prefix
    4. repeat with the new prefix from step 2. until the number of letters print equals a predefined constant called output_size

Specification of the Travesty Data Structure

The elements of the travesty-structure are simply pairs (prefix,s), where prefix is a userdefined type prefix_typeas array of characters (or string) and s is a character calledthe suffix. The user-defined type prefix_typeas array of characters (or string)consists only of upper-case letters and blanks while the suffix is a single upper-caseletter or a blank. For simplicity, the prefix_typehas been resolved into a singlecharacter in this assignment. We defined five operations for the travesty-structure.

void initialize_travesty();
results: the travesty structure is initialized (or created)
void update_travesty(char prefix, char s);
requires: initialize_travesty() is already executed
results: the pair (prefix, s) is added to travesty structure
char choose_prefix();
requires: update_travesty() is already executed
results: prefix is chosen with probability equal to its frequency
char choose_suffix(char prefix);
results: suffix s is chosen with probability equal to its frequency w.r.t. prefix. If there is
no pair of the form (prefix,-) in travesty structure, then s is chosen randomly and
uniformly from the set {‘A’, … , ‘Z, ‘ ‘}
void print_travesty();
results: the frequencies of prefixes and suffixes in the travesty structure is printed out in
a useful way for debugging purposes

travesty.cpp.rtf 

/*

* program description:

* anenglish text is given as input.  the program will read in

* the text and store in a data structure which is called travesty

* structure.  it consists of an array of prefix table.  the array

* is index by a fixed length prefix.  the prefix table has two fields.

* one is the total number of a particular prefix that appear in the text, while

* the other field is the array of suffix count which is index by a single

* character suffix from ‘A’ to ‘Z’ and ‘@’.

*

* the first part of the program will fill the travesty structure by reading

* one (prefix, s) pair at a time from the input english text.

*

* the second part of the program will print out the travesty structure based on

* the frequency of a prefix or suffix that appear in the original text.

*

* the third part of the program is to print out the random text.  i use a

* “line_length” constant to determine the output line length.

*

*

* assumptions:

* 1. the input text must be in CAPITAL letters and only consists of

*    ‘A’ to ‘Z’ and ‘ ‘.

*

* 2. the file size must be greater than the prefix size

*

* 3. all the newline marks are treated as blanks

*

* 4. the end of file mark indicates the end of file and it won’t be

*    treated as a character.  Therefore, the last valid character of the file

*    will be treated as suffix only

*

*/

#include <cstdlib> // srand() and rand()

#include <ctime> // time()

//********************

// global declarations

//********************

// constants

const unsigned OUTPUT_SIZE = 1000;

const unsigned LINE_LENGTH = 80;

const unsigned PREFIX_SIZE = 1;

const unsigned NUM_TABLE_INDEX = 27;

// travesty structure

chartable_index[NUM_TABLE_INDEX] =  {‘@’,

‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘J’, ‘K’, ‘L’, ‘M’,

‘N’, ‘O’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘U’, ‘V’, ‘W’, ‘X’, ‘Y’, ‘Z’};

unsignedprefix_count[NUM_TABLE_INDEX];  // count for each prefix

unsignedsuffix_count_wrt_prefix[NUM_TABLE_INDEX][NUM_TABLE_INDEX]; // 2D array of suffix count w.r.t. prefix

unsignedtotal_prefix_count;   // total number of prefix

//********************

// function prototypes

//********************

// random returns a number randomly and uniformly between 1 and limit

unsigned random(unsigned);

// results: the travesty-structure is initialized (or created)

voidinitialize_travesty();

// requires: initial_travesty() is already executed

// results: the pair (prefix , s) is added to travesty-structure

voidupdate_travesty(char prefix, char s);

// requires: update_travesty() is already executed

// results: prefix is chosen with probability equal to its frequency

charchoose_prefix();

// results: suffix s is chosen with proability equal to its frequency w.r.t. prefix

charchoose_suffix(char prefix);

// results: the frequencies of prefixes and suffixes in the travesty-structure is printed out in a useful way

voidprint_travesty();

//*************

// main program

//*************

int main() {

unsigned seed = time(0); // get the system time

srand(seed); // seed the random number generator

// put your code here

return 0;

}

//*********************

// function definitions

//*********************

unsigned random(unsigned limit) {

return (rand() % limit + 1);

}

charchoose_prefix() {

unsigned r, accum;

charcurr_table_index;

r = random(total_prefix_count); // pick a number r that is randomly and uniformly distributed between 1 and total_prefix_count

accum = 0;

curr_table_index = ‘@’-1; // the character, ‘?’,  precedes ‘@’

while (accum< r) {

curr_table_index++;  // the succeed character

accum += prefix_count[curr_table_index-‘@’]; // accumulate the frequency of each prefix

}

return ((curr_table_index == ‘@’) ? ‘ ‘ :curr_table_index);  // return the actual character

} 

Solution

 travesty.cpp

/*

*

* program description:

* anenglish text is given as input.  the program will read in

* the text and store in a data structure which is called travesty

* structure.  it consists of an array of prefix table.  the array

* is index by a fixed length prefix.  the prefix table has two fields.

* one is the total number of a particular prefix that appear in the text, while

* the other field is the array of suffix count which is index by a single

* character suffix from ‘A’ to ‘Z’ and ‘@’.

*

* the first part of the program will fill the travesty structure by reading

* one (prefix, s) pair at a time from the input english text.

*

* the second part of the program will print out the travesty structure based on

* the frequency of a prefix or suffix that appear in the original text.

*

* the third part of the program is to print out the random text.  i use a

* “line_length” constant to determine the output line length.

*

*

* assumptions:

* 1. the input text must be in CAPITAL letters and only consists of

*    ‘A’ to ‘Z’ and ‘ ‘.

*

* 2. the file size must be greater than the prefix size

*

* 3. all the newline marks are treated as blanks

*

* 4. the end of file mark indicates the end of file and it won’t be

*    treated as a character.  Therefore, the last valid character of the file

*    will be treated as suffix only

*

*/

#include <cstdlib> // srand() and rand()

#include <ctime> // time()

#include <cctype> // toupper()

#include <cstdio> // printf() to easy format the table

#include <fstream>  // ifstream

usingstd::ifstream;

//********************

// global declarations

//********************

// constants

const unsigned OUTPUT_SIZE = 1000;

const unsigned LINE_LENGTH = 80;

const unsigned PREFIX_SIZE = 1;

const unsigned NUM_TABLE_INDEX = 27;

// travesty structure

chartable_index[NUM_TABLE_INDEX] =  {‘@’,

‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘J’, ‘K’, ‘L’, ‘M’,

‘N’, ‘O’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘U’, ‘V’, ‘W’, ‘X’, ‘Y’, ‘Z’};

unsignedprefix_count[NUM_TABLE_INDEX];  // count for each prefix

unsignedsuffix_count_wrt_prefix[NUM_TABLE_INDEX][NUM_TABLE_INDEX]; // 2D array of suffix count w.r.t. prefix

unsignedtotal_prefix_count;   // total number of prefix

//********************

// function prototypes

//********************

// random returns a number randomly and uniformly between 1 and limit

unsigned random(unsigned);

// results: the travesty-structure is initialized (or created)

voidinitialize_travesty();

// requires: initial_travesty() is already executed

// results: the pair (prefix , s) is added to travesty-structure

voidupdate_travesty(char prefix, char s);

// requires: update_travesty() is already executed

// results: prefix is chosen with probability equal to its frequency

charchoose_prefix();

// results: suffix s is chosen with proability equal to its frequency w.r.t. prefix

charchoose_suffix(char prefix);

// results: the frequencies of prefixes and suffixes in the travesty-structure is printed out in a useful way

voidprint_travesty();

//*************

// main program

//*************

int main() {

ifstream in;  // input file

unsigned i;

unsignedlen; // when output, len is the number of characters in current line.

char c1, c2;  // store two continuous characters in the file

unsigned seed = time(0); // get the system time

srand(seed); // seed the random number generator

// initialize the data structure

initialize_travesty();

// open the input file and read character one by one

// to feed the data structure

in.open(“data”);

if (in.fail()) {

printf(“does not find data file in the current folder.\n”);

return -1;

}

if (in.get(c1)) {

while (in.get(c2)) {

// update the pair (c1, c2) in the data structure

update_travesty(c1, c2);

c1 = c2;

}

}

in.close();

// for debug purpose, print the table

print_travesty();

printf(“\n\nOUTPUT :\n\n”);

// now generate the output

if (OUTPUT_SIZE > 0) {

c1 = choose_prefix();

len = 0;

for (i = 0; i < OUTPUT_SIZE; i++) {

printf(“%c”, c1);

len ++; // increase the length of current line

if (len == LINE_LENGTH) { // generate a new line

len = 0;

printf(“\n”);

}

c1 = choose_suffix(c1); // generate next character

}

}

return 0;

}

//*********************

// function definitions

//*********************

unsigned random(unsigned limit) {

return (rand() % limit + 1);

}

charchoose_prefix() {

unsigned r, accum;

charcurr_table_index;

r = random(total_prefix_count); // pick a number r that is randomly and uniformly distributed between 1 and total_prefix_count

accum = 0;

curr_table_index = ‘@’-1; // the character, ‘?’,  precedes ‘@’

while (accum< r) {

curr_table_index++;  // the succeed character

accum += prefix_count[curr_table_index-‘@’]; // accumulate the frequency of each prefix

}

return ((curr_table_index == ‘@’) ? ‘ ‘ :curr_table_index);  // return the actual character

}

voidinitialize_travesty() {

unsignedint i, j;

// initialize all the counts of prefix and suffix to 0.

for (i = 0; i < NUM_TABLE_INDEX; i++) {

prefix_count[i] = 0;

for (j = 0; j < NUM_TABLE_INDEX; j++) {

suffix_count_wrt_prefix[i][j] = 0;

}

}

total_prefix_count = 0;

}

voidupdate_travesty(char prefix, char s) {

unsigned i, j; // index of prefix and s in the table

// replace any space with @ (internal symbol for spaces)

if (isspace(prefix)) {

prefix = ‘@’;

}

if (isspace(s)) {

s = ‘@’;

}

// calculate the index of prefix and s in the table

i = toupper(prefix) – ‘@’;

j = toupper(s) – ‘@’;

// increase the counts

prefix_count[i] ++;

suffix_count_wrt_prefix[i][j] ++;

total_prefix_count ++;

}

charchoose_suffix(char prefix) {

unsigned i, j, r, accum, curr_table_index;

unsignedtotal_suffix_count;

// replace any spaces with @ (internal symbol for spaces)

if (isspace(prefix)) {

prefix = ‘@’;

}

// calculate the total number of suffix for the given prefix

i = toupper(prefix) – ‘@’;

total_suffix_count = 0;

for (j = 0; j < NUM_TABLE_INDEX; j++) {

total_suffix_count += suffix_count_wrt_prefix[i][j];

}

// the following logic is similar to the code in choose_prefix

// the difference is here we choose a suffix in the array suffix_count_wrt_prefix[i].

//

r = random(total_suffix_count); // pick a number r that is randomly and uniformly distributed between 1 and total_suffix_count

accum = 0;

curr_table_index = ‘@’-1; // the character, ‘?’,  precedes ‘@’

while (accum< r) {

curr_table_index++;  // the succeed character

accum += suffix_count_wrt_prefix[i][curr_table_index-‘@’]; // accumulate the frequency of each prefix

}

return ((curr_table_index == ‘@’) ? ‘ ‘ :curr_table_index);

}

voidprint_travesty() {

unsigned i, j;

// print the total header

printf(“%5s %5s”, “CNT”, “PFX”);

for (i = 0; i < NUM_TABLE_INDEX; i++) {

printf(” %5c”, table_index[i]);

}

printf(“\n”);

// for each prefix, print the count of each suffix

for (i = 0; i < NUM_TABLE_INDEX; i++) {

printf(“%5d %5c”, prefix_count[i], table_index[i]);

for (j = 0; j < NUM_TABLE_INDEX; j++) {

printf(” %5d”, suffix_count_wrt_prefix[i][j]);

}

printf(“\n”);

}

// print the total number of prefix

printf(“total number of ALL prefixes count is : %d\n”, total_prefix_count);

}