This course has already ended.

The latest instance of the course can be found at: O1: 2024

Luet oppimateriaalin englanninkielistä versiota. Mainitsit kuitenkin taustakyselyssä osaavasi suomea. Siksi suosittelemme, että käytät suomenkielistä versiota, joka on testatumpi ja hieman laajempi ja muutenkin mukava.

Suomenkielinen materiaali kyllä esittelee englanninkielisetkin termit. Myös suomenkielisessä materiaalissa käytetään ohjelmaprojektien koodissa englanninkielisiä nimiä kurssin alkupään johdantoesimerkkejä lukuunottamatta.

Voit vaihtaa kieltä A+:n valikon yläreunassa olevasta painikkeesta. Tai tästä: Vaihda suomeksi.


Chapter 7.1: Streams

About This Page

Questions Answered: How do I write a program that reads in an unspecified amount of input or repeats some other operation a initially unknown number of times? How can I operate on a bunch of data one at a time without first storing all of it in memory?

Topics: Finite and infinite streams. By-name parameters.

What Will I Do? Read and program.

Rough Estimate of Workload:? Two or three hours? This is a fairly long chapter, but much of it is optional reading.

Points Available: A210. (There isn’t all that much to do, but the points value is high because the assignments are intended as near-mandatory.)

Related Projects: Sentiments (new), HigherOrder.

../_images/person04.png

Introduction

As a warm-up for the main subject of this chapter, streams, let’s consider another topic.

Suppose we intend to create a function fiveTimes. This function should return a five-element vector where each element is the value of a given Int expression:

fiveTimes(100)res0: Vector[Int] = Vector(100, 100, 100, 100, 100)
fiveTimes(1 + 1)res1: Vector[Int] = Vector(2, 2, 2, 2, 2)

We really want five separate evaluations of the same given expression. If different evaluations yield different results, that should be reflected in the output:

import scala.util.Randomimport scala.util.Random
fiveTimes(Random.nextInt(100))res2: Vector[Int] = Vector(43, 55, 21, 46, 87)
fiveTimes(Random.nextInt(100))res3: Vector[Int] = Vector(33, 65, 62, 31, 73)

Here’s an implementation for fiveTimes. Assess it.

def fiveTimes(numberExpression: Int) =
  Vector.tabulate(5)( anyIndex => numberExpression )

Which of the following claims are correct? Select all that apply.

Chapter 4.3 introduced you to the Option class and its method getOrElse.

Some(100).getOrElse(0)res4: Int = 100
None.getOrElse(0)res5: Int = 0
vectorOfWords.lift(-123).getOrElse("no word at that index")res6: String = "no word at that index"

Above, we used a literal as getOrElse’s parameter, but we could have used a compound expression as well:

None.getOrElse(1 + 1)res7: Int = 2
vectorOfNumbers.lift(-5).getOrElse(Random.nextInt(10))res8: Int = 8

There are a couple more expressions below. Work out how they behave.

Some(123).getOrElse(1 / 0)
Some(123).getOrElse(Random.nextInt(100))

Which of the following is correct?

Passing Unevaluated Parameters

Here’s a version of fiveTimes that works. It’s nearly identical to our earlier attempt:

def fiveTimes(numberExpression: =>Int) =
  Vector.tabulate(5)( anyIndex => numberExpression )
There is a lone arrow in front of the parameter type. It means that this is an unevaluated parameter, better known as a by-name parameter.

As we just re-established, ordinary parameters — also known as by-value parameters — are evaluated before the function call begins. What gets passed in is a value. In contrast, a by-name parameter is not evaluated before the function call begins: what gets passed in is an unevaluated expression. That parameter expression will be evaluated only when the function body uses it (if in fact it does use it). That is, the parameter is evaluated while the function is already running.

In case a function uses a by-name parameter multiple times, the parameter gets evaluated multiple times:

def fiveTimes(numberExpression: =>Int) =
  Vector.tabulate(5)( anyIndex => numberExpression )
tabulate forms a five-element vector, placing an Int at each of the indices from 0 to 4. Each element is obtained by evaluating numberExpression. If numberExpression is, say, Random.nextInt(100), five distinct random numbers will be placed in the vector.

You can think of by-name parameters as a sort of parameterless function that is passed to another function as a parameter.

Here’s an illustrated example of a by-name parameter:

In O1, you aren’t required to define methods that take by-name parameters. However, you will need to use such methods. There’s nothing much to it, and indeed you have already done so, since getOrElse on Options takes a by-name parameter. getOrElse has been defined to either a) return the contents of the Option wrapper and leave the parameter untouched, or b) evaluate the parameter expression and return the result if there’s nothing in the Option. This is why the method works as discussed above.

Another familiar example

In Chapter 5.1 you saw that the subexpression to the right of the logical operators && and || is evaluated only in case the subexpression on the left isn’t enough to decide the value of the logical expression. You also know that each of those operators is actually a method on Boolean objects (Chapter 5.2); the “subexpression to the right” is actually a by-name parameter of that method.

More jargon

If a parameter expression is evaluated at least once, the parameter is termed strict (tiukka). The parameter of fiveTimes is strict; so are all by-value parameters. If there is a possibility that a parameter does not get evaluated at all, it’s termed non-strict (väljä); getOrElse’s parameter is one example.

A+ presents the exercise submission form here.

Now to this chapter’s main subject.

Challenge: An Unknown Number of Repetitions

So far in O1, we have repeated commands by gathering all the necessary data in a collection, then working through that collection one element at a time. The number of repetitions has equalled the size of the collection, which has been known to our programs by the time we start traversing the collection. For example, we have often collected data in a vector, which is wholly stored in the computer’s memory; we have then used a for loop or a higher-order method to do something with some or all of the vector’s elements.

In contrast, consider these common scenarios:

  • We intend to read inputs from the user for the program to process until the user indicates they wish to stop. The program doesn’t know in advance how many inputs the user will enter before stopping; the user may not know, either.
  • We intend to process data from a source (a file, a network, a physical sensor governed by the program, etc.), processinga single element at a time. We don’t know the number of elements before starting and there could be so many of them that we don’t know if it’s possible to store all of them at once in the computer’s memory.
  • We intend to compute increasingly precise approximations of a desired quantity by repeatedly applying a particular computation; Newton’s method for finding roots is an example. We need an iterative process that applies the computation to the previous result until we reach sufficient precision. We don’t know in advance how many iterations it will take before that happens.

More generally still: we wish to repeat an operation an initially unknown number of times. In principle, there is no limit to the number of repetions; we might have a source that continuously generates more data, for instance.

A more concrete example

Let’s write a toy program that reports the lengths of four strings. There is nothing new here yet.

object LengthReport extends App {
  def report(input: String) = "The input is " + input.length + " characters long."
  def lines = Vector("a line of text", "another", "not written by the user", "but hardcoded into the program")
  lines.map(report).foreach(println)
}
We have an auxiliary function that forms a report of a single input.
Another function gives us a collection with some example data. In this first version of the program, that collection is a vector and contains five specific strings.
The last command uses these functions: we take the strings, generate a report from each one, and print out the reports.

Now, let’s undertake to edit that program so that it works like this, in the text console:

Enter some text: hello
The input is 5 characters long.
Enter some text: hello again
The input is 11 characters long.
Enter some text: third input
The input is 11 characters long.
Enter some text: fourth
The input is 6 characters long.
Enter some text: stop already
The input is 12 characters long.
Enter some text: stop
The input is 4 characters long.
Enter some text: please
Our program should not use any standard set of strings. It should prompt the user for input.
It should keep doing that until the user enters the word "please". Before that, the user may any number of inputs large or small.

How can we specify when the input-reading and report-printing should stop when we don’t know the number of inputs?

One way to do that is to use something called a stream. Before we apply streams to that problem, however, let’s start from some basics.

A Stream of Elements

A stream (virta) is collection of elements. The word is meant to evoke the notion that a flowing stream “brings” elements for processing.

You can create a stream by listing its elements, just like you can create other collections:

val streamOfData = Stream(10.2, 32.1, 3.14159)streamOfData: Stream[Double] = Stream(10.2, ?)
The toString on Streams, which is here used by the REPL, doesn’t list the entire contents of the stream. This is a symptom of how streams work; we’ll say more about that in a bit.

Another approach is to use toStream, which creates a stream from an existing collection (like toVector, toBuffer, etc.; Chapter 4.2):

val vectorOfWords = Vector("first", "second", "third", "fourth")vectorOfWords: Vector[String] = Vector(first, second, third, fourth)
val streamOfWords = vectorOfWords.toStreamstreamOfWords: Stream[String] = Stream(first, ?)
This gives us a stream that contains the same elements as the vector.

Streams have the collection methods you know. As an example, the following command skips past a couple of the first elements in the stream, then picks out the first of the remaining ones:

streamOfWords.drop(2).headres9: String = third

A loop works, too:

for (word <- streamOfWords) {
  println("data from the stream: " + word)
}data from the stream: first
data from the stream: second
data from the stream: third
data from the stream: fourth

As do higher-order methods:

streamOfWords.filter( _.length > 5 ).map( _ + "!" ).foreach(println)second!
fourth!

Nothing groundbreaking there. How are streams different?

Infinite Streams

One special thing about streams is that their size doesn’t need to be finite. You can define an endless stream of values. (Cf. infinite sequences in math.)

The factory method continually creates an infinite stream:

val myStream = Stream.continually("SPAM")myStream: Stream[String] = Stream(SPAM, ?)

We have just created a collection that has the string "SPAM" as each of its elements and that contains an infinite number of these identical elements. Let’s take the first five and print them out:

myStream.take(5).foreach(println)SPAM
SPAM
SPAM
SPAM
SPAM
The original stream is infinite, but take returns a finite stream of the given length. That stream contains some of the elements from the infinite stream.
foreach doesn’t have an infinite amount of work to do since we apply it to only the five-element stream returned by take.

Consider another example of an infinite stream. (It features the ++ operator from Chapter 4.2, which combines two collections.)

val words = Vector("first", "second", "third", "fourth")words: Vector[String] = Vector(first, second, third, fourth)
def wordStream = words.toStream ++ Stream.continually("OUT OF WORDS")wordStream: Stream[String]
wordStream.take(7).foreach(println)first
second
third
fourth
OUT OF WORDS
OUT OF WORDS
OUT OF WORDS
We form a stream that has the four elements for starters, followed a specific string that repeats an arbitrary number of times.
If we take the first seven elements, we get the four different ones and three copies of the repeating message.
If we hadn’t used toStream here, our code wouldn’t work. Why? What would happen? You can try it in the REPL, but if you have any unsaved work, save it first.

In the examples above, the infinite part of each stream was made up of identical elements. In the example below, that isn’t the case. Let’s form a stream of pseudorandom numbers.

val randomStream = Stream.continually( Random.nextInt(10) )randomStream: Stream[Int] = Stream(8, ?)

Now let’s take a few numbers:

randomStream.take(5).foreach(println)8
9
5
6
8

And here we generate random numbers between 0 and 99 until we happen to hit one that exceeds 90:

Stream.continually( Random.nextInt(100) ).takeWhile( _ <= 90 ).mkString(",")res10: String = 31,84,16,45,72,81,41,36,87,19,79,62,13,60,47,45,66,58,85,15,8,9,7,30,68,41,48,80,21,78,72,27
Stream.continually( Random.nextInt(100) ).takeWhile( _ <= 90 ).mkString(",")res11: String = 0,65,83,38,75,33,11,18,75,51,3
takeWhile returns a stream that terminates just before the first element that fails to match the given criterion. Even though our original stream is infinite, the resulting stream is not. However, like continually and take, takeWhile does not yet generate all the numbers; it merely returns a stream that is capable of generating them until a terminating element is reached.
mkString produces a string that describes all the elements in the stream. Had we invoked it on an infinite stream, we would crash the program by filling the available memory. On a finite stream, however, this works. In this example, the streams’ contents are random and we get streams of different lengths when we run the command multiple times.

How Streams Work

Obviously, a computer cannot store an infinite number of distinct elements, which is where the key feature of streams comes in: the whole stream isn’t constructed as soon as you create a stream object. Instead, the stream generates new elements only as needed. For instance, in order to run mkString above, it was necessary to generate some random numbers by repeatedly evaluating Random.nextInt(100). Those parts of the stream were actually generated in memory, but only when mkString needed them and only to the extent necessary.

Earlier in this chapter, we discussed by-name parameters, which are evaluated by the called function as needed. It is precisely such a by-name parameter that we passed to continually; the method forms a collection that keeps evaluating the parameter whenever we access the stream for a new element.

This principle of generating elements only when needed will be crucial as we use streams for handling input, next.

Back to Our Input-Processing Program

Earlier, we came up with this program that reports the lengths of four specific strings.

object LengthReport extends App {
  def report(input: String) = "The input is " + input.length + " characters long."
  def lines = Vector("a line of text", "another", "not written by the user", "but hardcoded into the program")
  lines.map(report).foreach(println)
}

What we wanted instead is a program that reports on the lengths of an arbitrary number of inputs and terminates with "please". With streams, this is easy to accomplish. Here’s an almost working implementation:

object SayPlease extends App {
  def report(input: String) = "The input is " + input.length + " characters long."
  def inputs = Stream.continually( readLine("Enter some text: ") )
  inputs.map(report).foreach(println)
}
The only difference to the previous version is: we don’t return a vector of predefined strings but a stream that “brings” inputs for the program to process. This is an infinite stream of strings that are generated by asking the user to provide them. This command alone does not yet prompt the user for all those inputs! It only defines a stream whose elements come from calling readLine whenever we need a new element.
map generates a “stream of reports” whose each element is formed (as needed) by calling readLine and applying report to the resulting string. This command also doesn’t prompt the user for the inputs, nor does it call report on them. It simply prepares the stream to do so if and when we later access the elements.
We use foreach to print out the elements of the report stream. In order to print an element, it’s necessary to determine what that element is by prompting the user for input and applying report. In practice, what we get is a program that repeatedly receives keyboard input and reports its length.

That pretty much works. But we didn’t attend to the magic word yet: the above program just keeps prompting for new inputs till the cows come home. This one doesn’t:

object SayPlease extends App {
  def report(input: String) = "The input is " + input.length + " characters long."
  def inputs = Stream.continually( readLine("Enter some text: ") )
  inputs.takeWhile( _ != "please" ).map(report).foreach(println)
}
takeWhile stems the stream at "please". No more elements (user inputs) will be generated once that element is reached. (Cf. the earlier example of takeWhile on a random stream.)

Practice on Streams

There’s a piece missing from this REPL example:

import scala.util.Randomimport scala.util.Random
val sumOfSixDieRolls = Stream.continually( Random.nextInt(6) ).???.sumsumOfSixDieRolls: Int = 19

What can you write in place of the question marks to make the code sum up six numbers, each between 1 and 6 (inclusive)?

import scala.io.StdIn._

object Program extends App {
  def streamOfInputs = Stream.continually( readLine("Write something: ") )
  val result = streamOfInputs.dropWhile( _.length < 5 ).head
  println(result)
}

What does that program print out?

import scala.io.StdIn._

object ExampleApp1 extends App {
  val userInputs = Stream.continually( readLine("Enter a number: ") )
  println(userInputs.take(4).mkString(","))
  println(userInputs.take(4).map( _.toInt ).product)
}

What does that program do? (Let’s assume the user types in integer numbers and doesn’t interrupt the program.)

import scala.io.StdIn._

object ExampleApp2 extends App {
  def userInputs = Stream.continually( readLine("Enter a number: ") )
  println(userInputs.take(4).mkString(","))
  println(userInputs.take(4).map( _.toInt ).product)
}
This program is identical to the previous one except that userInputs is defined as a function.
This means that accessing userInputs twice gives us two separate Stream objects.

Which of the following correctly describes what this program does?

Assignment: Sentiments about Movies

Let’s create a program that reads in the user’s short textual comments about movies. The program then tries to figure out whether each of those user inputs is positive or negative in tone. That estimate will be based on thousands of earlier movie reviews that have been written by real people, manually marked as positive or negative by a human, and made accessible to the program.

The more difficult and laborsome parts of this program have already been written. You’ll only need to put the user interface in order.

Task description

Your task is to fetch Sentiments and complete its user interface in MovieSentimentApp.scala. The UI should work in the text console as shown below.

Please comment on a movie or hit Enter to quit: This is a masterpiece, a truly fantastic work of cinema art.
I think this sentiment is positive. (Average word sentiment: 0.36.)

Please comment on a movie or hit Enter to quit: I hated it.
I think this sentiment is negative. (Average word sentiment: -0.41.)

Please comment on a movie or hit Enter to quit: The plot had holes in it.
I think this sentiment is negative. (Average word sentiment: -0.28.)

Please comment on a movie or hit Enter to quit: Adam Sandler
I think this sentiment is negative. (Average word sentiment: -0.80.)

Please comment on a movie or hit Enter to quit: It was great.
I think this sentiment is positive. (Average word sentiment: 0.04.)

Please comment on a movie or hit Enter to quit: It wasn't great.
I think this sentiment is negative. (Average word sentiment: -0.16.)

Please comment on a movie or hit Enter to quit: It was "great".
I think this sentiment is positive. (Average word sentiment: 0.04.)

Please comment on a movie, or hit Enter to quit: 
Bye.
The program doesn’t even try for a finespun analysis: it just lumps the user’s inputs in two categories: positive and negative.
The program also reports a number that it used in making each decision. A positive number means that, on average, the individual words of the input tend to occur in positive movie reviews, judging by the genuine data the program has been trained with.
If the input is blunt enough, the program classifies it quite accurately. It doesn’t get punctuation, though, or many of the other subtleties of human communication.
The user stops the program with an empty input. (The input must be genuinely empty with no characters, not even spaces.)

The training data tells our program which words occur in positive reviews and which words occur in negative ones. You can find that data in sample_reviews_from_rotten_tomatoes.txt within the Sentiments project. There’s no need for you to touch that data, but do take a look; it will help you understand how the program works.

The analyzer’s programmatic implementation is in SentimentAnalyzer.scala. Go ahead and browse the Scaladocs and the code if you want (but it's not required).

Implement the user interface in MovieSentimentApp.scala in package o1.sentiment.ui. The parts of the puzzle are there already, but you’ll need to use a stream and its methods to piece them together.

Instructions and hints

  • The UI should work much like our earlier example that reports the lengths of the inputs.
  • In fact, there is rather little the you need to do. You can solve the entire assignment with a couple of lines of code, or even just one.
  • The given analyzer is a simple example of automatic sentiment analysis. It may also be considered a very primitive example of machine learning, depending on how exactly you define that term. You’ll find some optional material on these topics below.

Submission form

A+ presents the exercise submission form here.

Sentiment analysis

Our example program categorizes emotions with an algorithm that is so simple that it’s even a bit surprising how often it’s accurate. The program reads in the training data, counts how often each different word appears in positive and negative reviews, and assigns each of those words a “positivity rating”. When the user types in something, the program finds all the known words in the input and computes the average of those words’ positivity ratings.

That algorithm is implemented in class SentimentAnalyzer; take a look if you want. However, the code will be easier to understand once we’ve reached Week 9; we’ll remind you then.

People have come up with considerably more sophisticated and accurate algorithms for analyzing sentiments. In fact, this assignment is based on a programming competition on the Kaggle web site.

Automatic sentiment analysis has diverse applications: it can improve the product recommendations that web sites give to potential buyers, for instance, and help humans in various diagnosis tasks. Sentiment analysis may be based on texts, as in our example, but can also take other forms; for more information, see the relevant article on Wikipedia.

Machine learning

Machine learning (koneoppiminen) is a white-hot subfield of computer science. In recent years, machine learning has grown into a constant topic of conversation in universities, the business world, and society at large. We hear about machines that learn to drive cars, predict the weather and the stock market, filter out spam, play board games, recognize objects and faces in images, recognize emotions from the tiniest changes of skin color, translate between languages, diagnose illnesses and security threats, evaluate insurance claims, sort garbage, and a seemingly endless number of other things.

Our sentiment-analysis program can be interpreted as a primitive example of machine learning, at least if we pick a suitably wide definition of this somewhat contested term.

At least, our program has several characteristics that are typical of machine learning:

  • The program receives training data, which it uses to “learn” about a particular phenomenon that the program specializes in. In this case, the training data consists of a file of annotated movie reviews, from which the program learns how positive and negative sentiments tend to be expressed. In other cases, the training data might consist of other kinds of text, images, numbers, etc.
  • The program uses the training data to determine parameters for its core algorithm: the data defines each word’s “positivity rating”. The key point is that the program code itself doesn’t say anything about how positive or negative a word might be; it’s all in the data.
  • Using what it learned from the data, the program is capable of dealing with novel situations as long as they are sufficiently similar to the training data. More specifically, this program is capable of analyzing new movie reviews.

One aspect of learning is conspicuously absent, however: we haven’t systematically assessed how accurate our program’s analyses are or used such assessments to improve the algorithm, let alone having the program assess and improve itself automatically. More sophisticated applications of machine learning commonly collect more data as they run and adjust their behavior accordingly. (A fun example is Google’s Quick, Draw! game.)

Machine learning features in many courses at Aalto. Programming 2 gives a brief introduction, and several later courses specifically deal with this topic, such as Machine Learning: Basic Principles. Moreover, there are several research groups at Aalto that apply machine learning and develop new machine-learning algorithms.

Was that artificial intelligence?

You decide.

Artificial intelligence or AI (tekoäly; keinoäly) is another fashionable but ambiguous term.

In the traditional sense, AI refers to the endeavor of artificially creating human-like intelligence: the attempt to model how humans think and to reproduce those processes computationally. These days, the term means a variety of things.

In casual conversation and marketing speech, AI might refer to just about any application that does something fancy enough to be called intelligent. Such an application perhaps makes people think that “that’s usually done by people, not machines” or “I didn’t think that machines can do that already”. In this everyday sense, our sentiment analyzer may count as AI in case its behavior is suitably impressive.

In the current news media, AI often refers to novel applications of machine learning, such as those listed above. These applications generally have little to do with human-like cognition; that isn’t their purpose at all. Quite the opposite, in fact: these programs construct mathematical models from input data and make use of the strengths that computers have over humans: speed, precision, and tirelessness.

When something like a computer winning at Chess or IBM’s Watson answering questions is marketed as AI, some pioneers of traditional AI get angry. (Claims about the annihilation of humankind by AI-gone-rogue can make them very angry.) Nevertheless, the word’s meaning appears to have permanently loosened from what it was. Perhaps AI will one day solve this terminological schism?

Aalto’s course CS-E4800 Artificial Intelligence introduces some forms of AI. The University of Helsinki offers a brief introductory course on AI as a MOOC.

Narrow, general, weak, and strong AI

Many experts are careful to draw a distiction between narrow and general AI.

Narrow AI refers to systems with “intelligence” that is limited to a particular specialization. Nearly all of the much-publicized applications of machine learning fall under narrow AI.

If we teach it to recognize cars [in images], all it does is recognize cars. There is no chance that it ends up doing something else.

—Timo Aila, researcher of machine learning (translated from an article in Finnish)

General AI is a still much loftier goal: intelligence that isn’t constrained to a specific domain but is capable of broad comprehension, learning about different domains and combining its knowledge of them. A general AI system could conceivably be programmed to develop narrow AI subsystems that deal with specific domains.

You’ll also hear the terms weak and strong AI, which mean roughly the same as narrow and general AI. This second set of terms sometimes carries the additional connotation that a strong AI system — unlike a weak one — is conscious on some level and might experience feelings.

Assignment: Rainfall

This assignment provides a bit more practice on reading in a stream of keyboard input.

Task description

In Chapter 6.4, you wrote averageRainfall. That function computes the average of a vector of rainfall measurements (Ints), stopping at 999999, which marks the end of the first data series. Any negative numbers in the input vector are ignored.

Now implement the same functionality as an interactive program. Place the program in project HigherOrder, in RainfallApp.scala

The program should work precisely as illustrated in the examples below.

Enter rainfall (or 999999 to stop): 10
Enter rainfall (or 999999 to stop): 5
Enter rainfall (or 999999 to stop): 100
Enter rainfall (or 999999 to stop): 10
Enter rainfall (or 999999 to stop): 5
Enter rainfall (or 999999 to stop): -100
Enter rainfall (or 999999 to stop): -100
Enter rainfall (or 999999 to stop): 110
Enter rainfall (or 999999 to stop): 999999
The average is 40.
Enter rainfall (or 999999 to stop): 999999
No valid data. Cannot compute average.
Enter rainfall (or 999999 to stop): -123
Enter rainfall (or 999999 to stop): -1
Enter rainfall (or 999999 to stop): 999999
No valid data. Cannot compute average.

Instructions and hints

  • In this assignment, you do not have to trouble yourself with what happens if the user enters invalid values such as the empty string "", "llama", or "3.141". You can assume that the entered text can be successfully interpreted as an integer by the toInt method (Chapter 5.2).
  • You could solve the assignment by reading the inputs with a stream, copying them into a vector (toVector), and then computing the average as you did in the earlier rainfall assignment. On the other hand, you don’t need to use a vector at all, since a stream also has the necessary methods.
    • For instance, it’s fine to call size and sum on a stream, as long as it’s finite.
    • If you call those methods on a stream, use a val that refers to the stream. (Be careful not to accidentally create multiple streams that read input from the keyboard.)
  • Make sure to spell the program’s output exactly right so that the automatic grader will give you the points you deserve.

Submission form

A+ presents the exercise submission form here.

Streams of Numbers, Conveniently

Is it possible to create an infinite vector with all the natural numbers, or something like that?

A vector (or a buffer or any other strictly evaluated collection) won’t do, because all of the vector’s elements are stored in memory and an infinite vector would need an infinite amount of storage.

But a stream will do:

def positiveNumbers = Stream.from(1)positiveNumbers: Stream[Int]
positiveNumbers.take(3).foreach(println)1
2
3
As it happens, the Scala API gives us from, a helpful method that creates a stream of increasing numbers. We pass in the first number.

Here we increment by ten at each element:

def evenTens = Stream.from(0, 10)evenTens: Stream[Int]
evenTens.take(3).foreach(println)0
10
20

And here we increment by -1, which gives us a stream of decreasing numbers:

def negativeNumbers = Stream.from(-1, -1)negativeNumbers: Stream[Int]
negativeNumbers.take(3).foreach(println)-1
-2
-3

Of course, from also works in combination with other methods:

val firstBigSquare = Stream.from(0).map( n => n * n ).dropWhile( _ <= 1234567 ).headfirstBigSquare: Int = 1236544
val words = Vector("first", "second", "third")words: Vector[String] = Vector(first, second, third)
val myStream = Stream.from(0).map( n => words(n % words.size) )myStream: Stream[String] = Stream(first, ?)

myStream refers to a stream. Which of the following options correctly describes that stream?

Optional Reading: Different Kinds of Streams

A stream that iterates

iterate is another method worth mentioning. It creates a stream that generates each element by re-applying a function to the previous element:

def alternating = Stream.iterate(1)( x => -2 * x )alternating: Stream[Int]
alternating.take(4).foreach(println)1
-2
4
-8
def babble = Stream.iterate("")( "blah" + _ )babble: Stream[String]
babble.take(4).foreach(println)
blah
blahblah
blahblahblah

The example below uses iterate to generate a stream that implements an algorithm that estimates the square root of a given number without resorting to the sqrt library function. (The program applies Newton’s method for approximating square roots of positive numbers.)

def squareRoot(n: Double) = {
  def isTooFar(approx: Double) = (approx * approx - n).abs > 0.0001
  def nextApprox(prev: Double) = (prev + n / prev) / 2
  def streamOfApproximations   = Stream.iterate(1.0)(nextApprox)
  streamOfApproximations.dropWhile(isTooFar).head
}squareRoot: (n: Double)Double
squareRoot(9)res12: Double = 3.000000001396984
squareRoot(654321)res13: Double = 808.9011064400888
The first of three auxiliary functions checks whether a given approximation isn’t close enough to the desired solution.
The second function uses an earlier approximation to generate another, as per Newton’s method.
The third generates an infinite stream of approximate solutions. The stream starts with a “guess” of 1.0 and generates better approximations by applying nextApprox repeatedly.
We get a good-enough approximation by dropping all those approximations at the front of the stream that aren’t close enough. Then we pick out the first element that remains.

A recursively defined stream

Methods such as continually, from, and iterate work well for many needs. But what if we didn’t have these factory methods? Or what if you want to define a stream that doesn’t match what those methods do?

You can use recursion to tailor exactly the sort of stream that you need.

Below is a very simple example of a recursive (i.e., self-referential) definition of a stream. This definition produces a stream of positive integers (just like Stream.from(1)).

def positiveNumbers(first: Int): Stream[Int] = first #:: positiveNumbers(first + 1)positiveNumbers: (first: Int)Stream[Int]
positiveNumbers(1).take(3).foreach(println)1
2
3
The operator #:: combines a single value and a stream, yielding another stream. The value to the left of the operator becomes the first element; it’s followed by the elements of the stream on the right-hand side.
The definition is recursive: it refers to itself. We form a sequence of positive integers by starting at the given integer and following it with a stream of positive integers that starts at the next integer.

In the Scala API, methods such as from and continually have recursive implementations. The same basic principle works for defining any kind of stream; experiment!

Recursion is a versatile and powerful technique; streams are just one of its countless applications. Chapter 12.1 will say more.

A code-reading challenge

Here’s a more elaborate example of streams. Can you figure out what this function does and how it does it?

The names of the function and its components are vague but not deliberately misleading.

This is a math-y sort of program.

def mystery(limit: Int): Vector[Int] = {
  import scala.math.sqrt

  val odd = Stream.from(3, 2)
  def candidates = odd.takeWhile( _ <= limit )

  def initialVals = odd.takeWhile( _ <= sqrt(limit).toInt )
  def multiples(n: Int) = Stream.from(n * n, 2 * n).takeWhile( _ <= limit )
  def rejected = initialVals.flatMap(multiples)

  val result = candidates.diff(rejected)
  result.toVector
}

Optional Reading: Properties of Streams

Streams are non-strict and lazy

We’ve already established that all the elements of a stream aren’t necessarily generated at all; in other words, a stream is a non-strict collection. We also noted that a stream element is generated only when that element is actually needed for some purpose.

Let’s return to this example:

val randomStream = Stream.continually( Random.nextInt(100) )randomStream: Stream[Int] = Stream(12, ?)
randomStream.take(5).mkString(",")res14: String = 12,62,28,14,31
The stream evaluates the element-generating expression Random.nextInt(100) only when or if a new element is needed. In this example, that happens five times, since mkString forces the stream to access five of its elements. In order to do that, the stream needs to evaluate the randomizing expression repeatedly.

Careful now: an element is generated into the stream whenever a new element is needed. What if you reuse a previously generated element?

randomStream.take(5).mkString(",")res15: String = 12,62,28,14,31
randomStream.take(5).mkString(",")res16: String = 12,62,28,14,31
randomStream.take(10).mkString(",")res17: String = 12,62,28,14,31,27,79,18,78,43
Looking again at the first five elements of our stream, we get the same numbers as before. The stream object has stored those elements and doesn’t re-evaluate the randomizing expression again when we repeat our request.
When we take a slightly longer chunk of the stream and check its contents, we get the five stored elements followed by some additional ones that are randomly generated (and stored) in order to respond to our request.

In programmer-speak, a stream is not merely non-strict but lazy (laiska): it not only doesn’t generate elements unless it needs them but also stores the elements to save itself the future trouble of having to generate them again.

Laziness is even more conspicuous in the next example.

Let’s start by creating a function that generates new elements and notifies us about it:

var counter = 0counter: Int = 0
def generateElement() = {
  val newElem = counter
  println("I guess I have to bother to generate a new element: " + newElem)
  counter += 1
  newElem
}generateElement: ()Int

Now we form a stream that uses the above function to generate its elements:

val streamOfInts = Stream.continually( generateElement() )I guess I have to bother to generate a new element: 0
streamOfInts: Stream[Int] = Stream(0, ?)
The first element is immediately generated. The REPL reports it as part of the toString output. (N.B. In future versions of the Scala API, this behavior will not occur; a stream will not generate even the first element like this.)

Examining the first five elements necessitates additional generateElement calls:

streamOfInts.take(5).mkString(",")I guess I have to bother to generate a new element: 1
I guess I have to bother to generate a new element: 2
I guess I have to bother to generate a new element: 3
I guess I have to bother to generate a new element: 4
res18: String = 0,1,2,3,4

Re-examining them does not, and the stream doesn’t do anything it doesn’t need to:

streamOfInts.take(5).mkString(",")res19: String = 0,1,2,3,4

Examining further elements makes the stream work:

streamOfInts.take(10).mkString(",")I guess I have to bother to generate a new element: 5
I guess I have to bother to generate a new element: 6
I guess I have to bother to generate a new element: 7
I guess I have to bother to generate a new element: 8
I guess I have to bother to generate a new element: 9
res20: String = 0,1,2,3,4,5,6,7,8,9

val, def, streams, and memory usage

As long as you have a reference to a stream stored somewhere, you don’t lose the stream to the garbage collector. This was illustrated in the examples just above, where we had variables (randomStream, streamOfInts) that referred to our streams. What if we hadn’t had them?

def randomStream = Stream.continually( Random.nextInt(100) )randomStream: Stream[Int]
randomStream.take(5).mkString(",")res21: String = 42,72,7,86,30
randomStream.take(5).mkString(",")res22: String = 2,2,64,87,54
Instead of a variable, we define a function that returns a reference to a new stream. Apart from that, this example is identical to our earlier one.
The expression randomStream calls a parameterless function. Whenever we evaluate this expression, we get an entirely new stream.
The two streams are independent of each other and have different elements.

Now that we didn’t store a reference to either of the random streams in a variable, the stream objects won’t stay in memory. The garbage collector promptly releases the memory for other use.

Let’s compare two more code fragments. Here’s the first one, in which a val refers to a stream:

val longStreamOfInputs = Stream.continually( readSomeDataFromSomewhere() )
longStreamOfInputs.map(doStuffWithDataElement).foreach(println)
longStreamOfInputs.map(doSomethingElseWithElement).foreach(println)
We store the stream reference in a variable. We intend to read a large but finite amount of data from a source (a file, the internet, the keyboard; somewhere).
When we subsequently command the stream to process its elements and print out the results, the entire stream is traversed and each of its elements generated with a function that reads a new element into memory. The variable longStreamOfInputs stores a reference to all the data.
Later, we take the previously stored inputs and process them in some other way. The latter traversal doesn’t fetch the inputs again: it uses the stored elements in the stream.

Here’s the same code but with a def instead of a val:

def longStreamOfInputs = Stream.continually( readSomeDataFromSomewhere() )
longStreamOfInputs.map(doStuffWithDataElement).foreach(println)
longStreamOfInputs.map(doSomethingElseWithElement).foreach(println)

This program doesn’t store the stream and its contents anywhere. In fact, even while the computer processes the stream one element at a time, each previously processed element becomes fair game for the garbage collector, which operates in the background. Especially if the stream is long, it can happen that the first elements have been processed and removed from memory before the last elements have even been generated.

The input data is not stored anywhere. The second traversal causes all the input to be re-read.

Whether you store the stream in a variable can thus make a big difference. Circumstances dictate whether it’s a good idea to do so:

  • If the stream is short, if there is plenty of memory available, and especially if the goal is to traverse the same elements multiple times, then it’s a good idea to use a variable.
  • If the elements are traversed only once and each one can be discarded as soon as it’s dealt with (as in our SentimentAnalyzer, for instance), then def is the better choice. This approach also has the benefit that all the elements don’t have to fit in memory at once.

When operating on small amounts of data, this difference between the alternatives is often negligible.

The internal implementation of streams is based on linking each element to the next element in the stream. This linked structure makes it possible to discard previously processed parts of the stream from memory, as described above; on the other hand, it also means that the only efficient way to process a stream is to advance linearly from the head rather than picking out elements at arbitrary indices (which is fine on buffers and vectors, for example). You’ll learn more about linked structures and efficiency in various follow-on courses, including Programming 2.

Lazy variables

What’s the difference between these two programs?

val result = "llama".length
println(result)
println(result)
def result = "llama".length
println(result)
println(result)

The answer should be clear enough at this stage. The first program, takes care of business (computes the string’s length) right away exactly once and stores the result in memory (in a val). The second, which defines a function, initially doesn’t compute the result at all and never would if you never called the function; on the other hand, it does the same job multiple times if you call the function repeatedly. In other words, the second program computes the length twice; the first program uses some additional memory to avoid recomputing it.

We’ve established that a stream takes the middle approach between those two extremes: it takes care of business (generates elements) only upon request but then stores the result and avoids recomputing it. We called this behavior “lazy”.

An individual variable can also be lazy. In Scala, you define such a variable by writing lazy val:

lazy val result = "llama".lengthresult: Int = <lazy>
println(result)5
println(result)5
Even though we assigned "llama".length" to the variable, that expression is not yet evaluated. For now, the lazy variable is linked to the unevaluated expression.
The REPL’s output of <lazy> means that this variable doesn’t have a value yet.
When we actually do something with the variable (here: we print its value), the computer needs to evaluate the assigned expression. It is only at this point that the string’s length is computed and an integer gets stored in the variable.
The second print command looks the same as the first. However, the computer does not recompute the string length at this point. It doesn’t have to bother because the length is already stored in the variable.

Here’s another example. Let’s define a function and a couple of lazy variables.

def printAndReturn(number: Int) = {
  println("Returning my parameter value of " + number)
  number
}printAndReturn: (number: Int)Int
lazy val first = printAndReturn(1)first: Int = <lazy>
lazy val second = printAndReturn(2)second: Int = <lazy>

Neither of the two variable definitions above caused printAndReturn to be invoked yet, as you can see from the fact that the println within the function didn’t trigger yet. Let’s continue:

if (first > 0) first * 10 else second * 10Returning my parameter value of 1
res23: Int = 10
if (first > 0) first * 10 else second * 10res24: Int = 10
In order to evaluate the conditional expression, the computer needs to evaluate first, so it determines this lazy variable’s value by calling printAndReturn. We get a line of output from the function.
The first branch is chosen, so the computer evaluates first * 10 next. first already has a value, which doesn’t get recomputed. Notice that we don’t get an additional line of output, as we would have if first had been a def rather than lazy val.
Even issuing the entire command again doesn’t trigger printAndReturn again, since first already has a value.
Since the else branch was never chosen, the value of second is unimportant. Indeed, the computer never even computed a value for that variable.

In some programming languages, most famously in Haskell, all variables are lazy and all expressions are evaluated non-strictly. In most programming languages, variables aren’t lazy and expression evaluation is largely strict.

In Scala, too, strict evaluation is the default, but as you have seen in this chapter, you can opt to use non-strict parameters and lazy collections and variables where you think it’s called for.

Summary of Key Points

  • A stream is a collection whose elements aren’t formed as soon as the collection is created. They are generated and stored as needed.
    • A stream is intended for traversing the elements in linear order.
    • You can process a stream’s elements one at a time without storing them all simultaneously in the computer’s memory or even knowing how many elements there are or will be.
    • Unlike the previously familiar collections, a stream can be either finite or infinite. You can use methods such as take and takeWhile to select a finite part of an infinite stream.
    • There are many ways to form a stream. For instance, you can repeat an operation that creates elements (continually), produce elements in numerical order (from), or repeatedly apply an operation to the previous element (iterate).
  • So-called by-name parameters are received by the function as unevaluated expressions. This contrasts with ordinary (by-value) parameters, which receive the previously computed values of expressions.
    • The receiving method may evaluate a by-name parameter once, more than once, or not at all, as appropriate.
    • These parameters are highly useful for some purposes, such as forming streams.
  • Links to the glossary: stream; by-name parameter; strict, non-strict, and lazy evaluation.

Soon to be LazyList

We’re using version 2.12 of Scala. From the brand-new version 2.13 onwards, the Stream class is replaced by a class named LazyList. We’ve used Stream here, but the basic idea is exactly the same for LazyList.

Feedback

Please note that this section must be completed individually. Even if you worked on this chapter with a pair, each of you should submit the form separately.

Credits

Thousands of students have given feedback that has contributed to this ebook’s design. Thank you!

Weeks 1 to 13 of the ebook, including the assignments and weekly bulletins, have been written in Finnish and translated into English by Juha Sorva.

Weeks 14 to 20 are by Otto Seppälä. That part of the ebook isn’t available during the fall term, but we’ll publish it when it’s time.

The appendices (glossary, Scala reference, FAQ, etc.) are by Juha Sorva unless otherwise specified on the page.

The automatic assessment of the assignments has been developed by: (in alphabetical order) Riku Autio, Nikolas Drosdek, Joonatan Honkamaa, Jaakko Kantojärvi, Niklas Kröger, Teemu Lehtinen, Strasdosky Otewa, Timi Seppälä, Teemu Sirkiä, and Aleksi Vartiainen.

The illustrations at the top of each chapter, and the similar drawings elsewhere in the ebook, are the work of Christina Lassheikki.

The animations that detail the execution Scala programs have been designed by Juha Sorva and Teemu Sirkiä. Teemu Sirkiä and Riku Autio have done the technical implementation, relying on Teemu’s Jsvee and Kelmu toolkits.

The other diagrams and interactive presentations in the ebook are by Juha Sorva.

The O1Library software has been developed by Aleksi Lukkarinen and Juha Sorva. Several of its key components are built upon Aleksi’s SMCL library.

The pedagogy behind O1Library’s tools for simple graphical programming (such as Pic) is inspired by the textbooks How to Design Programs by Flatt, Felleisen, Findler, and Krishnamurthi and Picturing Programs by Stephen Bloch.

The course platform A+ has been created by Aalto’s LeTech research group and is largely developed by students. The current lead developer is Jaakko Kantojärvi; many other students of computer science and information networks are also active on the project.

For O1’s current teaching staff, please see Chapter 1.1.

Additional credits for this page

The rainfall assignment is a version of a classic programming problem by Elliot Soloway.

The sentiment-analyzer assignment is an adaptation of a programming assignment by Eric D. Manley and Timothy M. Urness, which is in turn based on a programming contest on the web site Kaggle.

a drop of ink
Posting submission...