The latest instance of the course can be found at: O1: 2024
Luet oppimateriaalin englanninkielistä versiota. Mainitsit kuitenkin taustakyselyssä osaavasi suomea. Siksi suosittelemme, että käytät suomenkielistä versiota, joka on testatumpi ja hieman laajempi ja muutenkin mukava.
Suomenkielinen materiaali kyllä esittelee englanninkielisetkin termit. Myös suomenkielisessä materiaalissa käytetään ohjelmaprojektien koodissa englanninkielisiä nimiä kurssin alkupään johdantoesimerkkejä lukuunottamatta.
Voit vaihtaa kieltä A+:n valikon yläreunassa olevasta painikkeesta. Tai tästä: Vaihda suomeksi.
Chapter 7.1: Streams
About This Page
Questions Answered: How do I write a program that reads in an unspecified amount of input or repeats some other operation a initially unknown number of times? How can I operate on a bunch of data one at a time without first storing all of it in memory?
Topics: Finite and infinite streams. By-name parameters.
What Will I Do? Read and program.
Rough Estimate of Workload:? Two or three hours? This is a fairly long chapter, but much of it is optional reading.
Points Available: A210. (There isn’t all that much to do, but the points value is high because the assignments are intended as near-mandatory.)
Related Projects: Sentiments (new), HigherOrder.
Introduction
As a warm-up for the main subject of this chapter, streams, let’s consider another topic.
Suppose we intend to create a function fiveTimes
. This function should return a
five-element vector where each element is the value of a given Int
expression:
fiveTimes(100)res0: Vector[Int] = Vector(100, 100, 100, 100, 100) fiveTimes(1 + 1)res1: Vector[Int] = Vector(2, 2, 2, 2, 2)
We really want five separate evaluations of the same given expression. If different evaluations yield different results, that should be reflected in the output:
import scala.util.Randomimport scala.util.Random fiveTimes(Random.nextInt(100))res2: Vector[Int] = Vector(43, 55, 21, 46, 87) fiveTimes(Random.nextInt(100))res3: Vector[Int] = Vector(33, 65, 62, 31, 73)
Passing Unevaluated Parameters
Here’s a version of fiveTimes
that works. It’s nearly identical to our earlier
attempt:
def fiveTimes(numberExpression: =>Int) =
Vector.tabulate(5)( anyIndex => numberExpression )
As we just re-established, ordinary parameters — also known as by-value parameters — are evaluated before the function call begins. What gets passed in is a value. In contrast, a by-name parameter is not evaluated before the function call begins: what gets passed in is an unevaluated expression. That parameter expression will be evaluated only when the function body uses it (if in fact it does use it). That is, the parameter is evaluated while the function is already running.
In case a function uses a by-name parameter multiple times, the parameter gets evaluated multiple times:
def fiveTimes(numberExpression: =>Int) =
Vector.tabulate(5)( anyIndex => numberExpression )
tabulate
forms a five-element vector, placing an Int
at each
of the indices from 0 to 4. Each element is obtained by evaluating
numberExpression
. If numberExpression
is, say, Random.nextInt(100)
,
five distinct random numbers will be placed in the vector.You can think of by-name parameters as a sort of parameterless function that is passed to another function as a parameter.
Here’s an illustrated example of a by-name parameter:
In O1, you aren’t required to define methods that take by-name parameters. However, you
will need to use such methods. There’s nothing much to it, and indeed you have already
done so, since getOrElse
on Option
s takes a by-name parameter. getOrElse
has been
defined to either a) return the contents of the Option
wrapper and leave the parameter
untouched, or b) evaluate the parameter expression and return the result if there’s nothing
in the Option
. This is why the method works as discussed above.
Another familiar example
In Chapter 5.1 you saw that the subexpression to the right of the logical
operators &&
and ||
is evaluated only in case the subexpression on the
left isn’t enough to decide the value of the logical expression. You also
know that each of those operators is actually a method on Boolean
objects
(Chapter 5.2); the “subexpression to the right” is actually a by-name
parameter of that method.
More jargon
If a parameter expression is evaluated at least once, the parameter is
termed strict (tiukka). The parameter of fiveTimes
is strict; so
are all by-value parameters. If there is a possibility that a parameter
does not get evaluated at all, it’s termed non-strict (väljä);
getOrElse
’s parameter is one example.
A+ presents the exercise submission form here.
Now to this chapter’s main subject.
Challenge: An Unknown Number of Repetitions
So far in O1, we have repeated commands by gathering all the necessary data in a collection,
then working through that collection one element at a time. The number of repetitions has
equalled the size of the collection, which has been known to our programs by the time we
start traversing the collection. For example, we have often collected data in a vector,
which is wholly stored in the computer’s memory; we have then used a for
loop or a
higher-order method to do something with some or all of the vector’s elements.
In contrast, consider these common scenarios:
- We intend to read inputs from the user for the program to process until the user indicates they wish to stop. The program doesn’t know in advance how many inputs the user will enter before stopping; the user may not know, either.
- We intend to process data from a source (a file, a network, a physical sensor governed by the program, etc.), processinga single element at a time. We don’t know the number of elements before starting and there could be so many of them that we don’t know if it’s possible to store all of them at once in the computer’s memory.
- We intend to compute increasingly precise approximations of a desired quantity by repeatedly applying a particular computation; Newton’s method for finding roots is an example. We need an iterative process that applies the computation to the previous result until we reach sufficient precision. We don’t know in advance how many iterations it will take before that happens.
More generally still: we wish to repeat an operation an initially unknown number of times. In principle, there is no limit to the number of repetions; we might have a source that continuously generates more data, for instance.
A more concrete example
Let’s write a toy program that reports the lengths of four strings. There is nothing new here yet.
object LengthReport extends App {
def report(input: String) = "The input is " + input.length + " characters long."
def lines = Vector("a line of text", "another", "not written by the user", "but hardcoded into the program")
lines.map(report).foreach(println)
}
Now, let’s undertake to edit that program so that it works like this, in the text console:
Enter some text: hello The input is 5 characters long. Enter some text: hello again The input is 11 characters long. Enter some text: third input The input is 11 characters long. Enter some text: fourth The input is 6 characters long. Enter some text: stop already The input is 12 characters long. Enter some text: stop The input is 4 characters long. Enter some text: please
"please"
.
Before that, the user may any number of inputs large or small.How can we specify when the input-reading and report-printing should stop when we don’t know the number of inputs?
One way to do that is to use something called a stream. Before we apply streams to that problem, however, let’s start from some basics.
A Stream of Elements
A stream (virta) is collection of elements. The word is meant to evoke the notion that a flowing stream “brings” elements for processing.
You can create a stream by listing its elements, just like you can create other collections:
val streamOfData = Stream(10.2, 32.1, 3.14159)streamOfData: Stream[Double] = Stream(10.2, ?)
toString
on Stream
s, which is here used by the REPL,
doesn’t list the entire contents of the stream. This is a
symptom of how streams work; we’ll say more about that in a bit.Another approach is to use toStream
, which creates a stream from an existing collection
(like toVector
, toBuffer
, etc.; Chapter 4.2):
val vectorOfWords = Vector("first", "second", "third", "fourth")vectorOfWords: Vector[String] = Vector(first, second, third, fourth)
val streamOfWords = vectorOfWords.toStreamstreamOfWords: Stream[String] = Stream(first, ?)
Streams have the collection methods you know. As an example, the following command skips past a couple of the first elements in the stream, then picks out the first of the remaining ones:
streamOfWords.drop(2).headres9: String = third
A loop works, too:
for (word <- streamOfWords) { println("data from the stream: " + word) }data from the stream: first data from the stream: second data from the stream: third data from the stream: fourth
As do higher-order methods:
streamOfWords.filter( _.length > 5 ).map( _ + "!" ).foreach(println)second! fourth!
Nothing groundbreaking there. How are streams different?
Infinite Streams
One special thing about streams is that their size doesn’t need to be finite. You can define an endless stream of values. (Cf. infinite sequences in math.)
The factory method continually
creates an infinite stream:
val myStream = Stream.continually("SPAM")myStream: Stream[String] = Stream(SPAM, ?)
We have just created a collection that has the string "SPAM"
as each of its elements
and that contains an infinite number of these identical elements. Let’s take
the first
five and print them out:
myStream.take(5).foreach(println)SPAM SPAM SPAM SPAM SPAM
take
returns a finite
stream of the given length. That stream contains some of the
elements from the infinite stream.foreach
doesn’t have an infinite amount of work to do since
we apply it to only the five-element stream returned by take
.Consider another example of an infinite stream. (It features the ++
operator from Chapter 4.2,
which combines two collections.)
val words = Vector("first", "second", "third", "fourth")words: Vector[String] = Vector(first, second, third, fourth) def wordStream = words.toStream ++ Stream.continually("OUT OF WORDS")wordStream: Stream[String] wordStream.take(7).foreach(println)first second third fourth OUT OF WORDS OUT OF WORDS OUT OF WORDS
toStream
here, our code wouldn’t work. Why?
What would happen? You can try it in the REPL, but if you have
any unsaved work, save it first.In the examples above, the infinite part of each stream was made up of identical elements. In the example below, that isn’t the case. Let’s form a stream of pseudorandom numbers.
val randomStream = Stream.continually( Random.nextInt(10) )randomStream: Stream[Int] = Stream(8, ?)
Now let’s take a few numbers:
randomStream.take(5).foreach(println)8 9 5 6 8
And here we generate random numbers between 0 and 99 until we happen to hit one that exceeds 90:
Stream.continually( Random.nextInt(100) ).takeWhile( _ <= 90 ).mkString(",")res10: String = 31,84,16,45,72,81,41,36,87,19,79,62,13,60,47,45,66,58,85,15,8,9,7,30,68,41,48,80,21,78,72,27 Stream.continually( Random.nextInt(100) ).takeWhile( _ <= 90 ).mkString(",")res11: String = 0,65,83,38,75,33,11,18,75,51,3
takeWhile
returns a stream that terminates just before the
first element that fails to match the given criterion. Even
though our original stream is infinite, the resulting stream is
not. However, like continually
and take
, takeWhile
does
not yet generate all the numbers; it merely returns a stream that
is capable of generating them until a terminating element is
reached.mkString
produces a string that describes all the elements in
the stream. Had we invoked it on an infinite stream, we would crash
the program by filling the available memory. On a finite stream,
however, this works. In this example, the streams’ contents are
random and we get streams of different lengths when we run the
command multiple times.How Streams Work
Obviously, a computer cannot store an infinite number of distinct elements, which is where
the key feature of streams comes in: the whole stream isn’t constructed as soon as you create
a stream object. Instead, the stream generates new elements only as needed. For instance, in
order to run mkString
above, it was necessary to generate some random numbers by repeatedly
evaluating Random.nextInt(100)
. Those parts of the stream were actually generated in memory,
but only when mkString
needed them and only to the extent necessary.
Earlier in this chapter, we discussed by-name parameters, which are evaluated by the called
function as needed. It is precisely such a by-name parameter that we passed to continually
;
the method forms a collection that keeps evaluating the parameter whenever we access the
stream for a new element.
This principle of generating elements only when needed will be crucial as we use streams for handling input, next.
Back to Our Input-Processing Program
Earlier, we came up with this program that reports the lengths of four specific strings.
object LengthReport extends App {
def report(input: String) = "The input is " + input.length + " characters long."
def lines = Vector("a line of text", "another", "not written by the user", "but hardcoded into the program")
lines.map(report).foreach(println)
}
What we wanted instead is a program that reports on the lengths of an arbitrary number
of inputs and terminates with "please"
. With streams, this is easy to accomplish. Here’s
an almost working implementation:
object SayPlease extends App {
def report(input: String) = "The input is " + input.length + " characters long."
def inputs = Stream.continually( readLine("Enter some text: ") )
inputs.map(report).foreach(println)
}
readLine
whenever we need a new element.map
generates a “stream of reports” whose each element is
formed (as needed) by calling readLine
and applying report
to
the resulting string. This command also doesn’t prompt the user
for the inputs, nor does it call report
on them. It simply prepares
the stream to do so if and when we later access the elements.foreach
to print out the elements of the report stream.
In order to print an element, it’s necessary to determine what that
element is by prompting the user for input and applying report
.
In practice, what we get is a program that repeatedly receives
keyboard input and reports its length.That pretty much works. But we didn’t attend to the magic word yet: the above program just keeps prompting for new inputs till the cows come home. This one doesn’t:
object SayPlease extends App {
def report(input: String) = "The input is " + input.length + " characters long."
def inputs = Stream.continually( readLine("Enter some text: ") )
inputs.takeWhile( _ != "please" ).map(report).foreach(println)
}
takeWhile
stems the stream at "please"
. No more elements
(user inputs) will be generated once that element is reached.
(Cf. the earlier example of takeWhile
on a random stream.)Practice on Streams
Assignment: Sentiments about Movies
Let’s create a program that reads in the user’s short textual comments about movies. The program then tries to figure out whether each of those user inputs is positive or negative in tone. That estimate will be based on thousands of earlier movie reviews that have been written by real people, manually marked as positive or negative by a human, and made accessible to the program.
The more difficult and laborsome parts of this program have already been written. You’ll only need to put the user interface in order.
Task description
Your task is to fetch Sentiments and complete its user interface in MovieSentimentApp.scala
.
The UI should work in the text console as shown below.
Please comment on a movie or hit Enter to quit: This is a masterpiece, a truly fantastic work of cinema art. I think this sentiment is positive. (Average word sentiment: 0.36.) Please comment on a movie or hit Enter to quit: I hated it. I think this sentiment is negative. (Average word sentiment: -0.41.) Please comment on a movie or hit Enter to quit: The plot had holes in it. I think this sentiment is negative. (Average word sentiment: -0.28.) Please comment on a movie or hit Enter to quit: Adam Sandler I think this sentiment is negative. (Average word sentiment: -0.80.) Please comment on a movie or hit Enter to quit: It was great. I think this sentiment is positive. (Average word sentiment: 0.04.) Please comment on a movie or hit Enter to quit: It wasn't great. I think this sentiment is negative. (Average word sentiment: -0.16.) Please comment on a movie or hit Enter to quit: It was "great". I think this sentiment is positive. (Average word sentiment: 0.04.) Please comment on a movie, or hit Enter to quit: Bye.
The training data tells our program which words occur in positive reviews and which
words occur in negative ones. You can find that data in sample_reviews_from_rotten_tomatoes.txt
within the Sentiments project. There’s no need for you to touch that data, but do take
a look; it will help you understand how the program works.
The analyzer’s programmatic implementation is in SentimentAnalyzer.scala
. Go ahead
and browse the Scaladocs and the code if you want (but it's not required).
Implement the user interface in MovieSentimentApp.scala
in package o1.sentiment.ui
.
The parts of the puzzle are there already, but you’ll need to use a stream and its methods
to piece them together.
Instructions and hints
- The UI should work much like our earlier example that reports the lengths of the inputs.
- In fact, there is rather little the you need to do. You can solve the entire assignment with a couple of lines of code, or even just one.
- The given analyzer is a simple example of automatic sentiment analysis. It may also be considered a very primitive example of machine learning, depending on how exactly you define that term. You’ll find some optional material on these topics below.
Submission form
A+ presents the exercise submission form here.
Sentiment analysis
Our example program categorizes emotions with an algorithm that is so simple that it’s even a bit surprising how often it’s accurate. The program reads in the training data, counts how often each different word appears in positive and negative reviews, and assigns each of those words a “positivity rating”. When the user types in something, the program finds all the known words in the input and computes the average of those words’ positivity ratings.
That algorithm is implemented in class SentimentAnalyzer
; take a
look if you want. However, the code will be easier to understand
once we’ve reached Week 9; we’ll remind you then.
People have come up with considerably more sophisticated and accurate algorithms for analyzing sentiments. In fact, this assignment is based on a programming competition on the Kaggle web site.
Automatic sentiment analysis has diverse applications: it can improve the product recommendations that web sites give to potential buyers, for instance, and help humans in various diagnosis tasks. Sentiment analysis may be based on texts, as in our example, but can also take other forms; for more information, see the relevant article on Wikipedia.
Machine learning
Machine learning (koneoppiminen) is a white-hot subfield of computer science. In recent years, machine learning has grown into a constant topic of conversation in universities, the business world, and society at large. We hear about machines that learn to drive cars, predict the weather and the stock market, filter out spam, play board games, recognize objects and faces in images, recognize emotions from the tiniest changes of skin color, translate between languages, diagnose illnesses and security threats, evaluate insurance claims, sort garbage, and a seemingly endless number of other things.
Our sentiment-analysis program can be interpreted as a primitive example of machine learning, at least if we pick a suitably wide definition of this somewhat contested term.
At least, our program has several characteristics that are typical of machine learning:
- The program receives training data, which it uses to “learn” about a particular phenomenon that the program specializes in. In this case, the training data consists of a file of annotated movie reviews, from which the program learns how positive and negative sentiments tend to be expressed. In other cases, the training data might consist of other kinds of text, images, numbers, etc.
- The program uses the training data to determine parameters for its core algorithm: the data defines each word’s “positivity rating”. The key point is that the program code itself doesn’t say anything about how positive or negative a word might be; it’s all in the data.
- Using what it learned from the data, the program is capable of dealing with novel situations as long as they are sufficiently similar to the training data. More specifically, this program is capable of analyzing new movie reviews.
One aspect of learning is conspicuously absent, however: we haven’t systematically assessed how accurate our program’s analyses are or used such assessments to improve the algorithm, let alone having the program assess and improve itself automatically. More sophisticated applications of machine learning commonly collect more data as they run and adjust their behavior accordingly. (A fun example is Google’s Quick, Draw! game.)
Machine learning features in many courses at Aalto. Programming 2 gives a brief introduction, and several later courses specifically deal with this topic, such as Machine Learning: Basic Principles. Moreover, there are several research groups at Aalto that apply machine learning and develop new machine-learning algorithms.
Was that artificial intelligence?
You decide.
Artificial intelligence or AI (tekoäly; keinoäly) is another fashionable but ambiguous term.
In the traditional sense, AI refers to the endeavor of artificially creating human-like intelligence: the attempt to model how humans think and to reproduce those processes computationally. These days, the term means a variety of things.
In casual conversation and marketing speech, AI might refer to just about any application that does something fancy enough to be called intelligent. Such an application perhaps makes people think that “that’s usually done by people, not machines” or “I didn’t think that machines can do that already”. In this everyday sense, our sentiment analyzer may count as AI in case its behavior is suitably impressive.
In the current news media, AI often refers to novel applications of machine learning, such as those listed above. These applications generally have little to do with human-like cognition; that isn’t their purpose at all. Quite the opposite, in fact: these programs construct mathematical models from input data and make use of the strengths that computers have over humans: speed, precision, and tirelessness.
When something like a computer winning at Chess or IBM’s Watson answering questions is marketed as AI, some pioneers of traditional AI get angry. (Claims about the annihilation of humankind by AI-gone-rogue can make them very angry.) Nevertheless, the word’s meaning appears to have permanently loosened from what it was. Perhaps AI will one day solve this terminological schism?
Aalto’s course CS-E4800 Artificial Intelligence introduces some forms of AI. The University of Helsinki offers a brief introductory course on AI as a MOOC.
Narrow, general, weak, and strong AI
Many experts are careful to draw a distiction between narrow and general AI.
Narrow AI refers to systems with “intelligence” that is limited to a particular specialization. Nearly all of the much-publicized applications of machine learning fall under narrow AI.
If we teach it to recognize cars [in images], all it does is recognize cars. There is no chance that it ends up doing something else.
—Timo Aila, researcher of machine learning (translated from an article in Finnish)
General AI is a still much loftier goal: intelligence that isn’t constrained to a specific domain but is capable of broad comprehension, learning about different domains and combining its knowledge of them. A general AI system could conceivably be programmed to develop narrow AI subsystems that deal with specific domains.
You’ll also hear the terms weak and strong AI, which mean roughly the same as narrow and general AI. This second set of terms sometimes carries the additional connotation that a strong AI system — unlike a weak one — is conscious on some level and might experience feelings.
Assignment: Rainfall
This assignment provides a bit more practice on reading in a stream of keyboard input.
Task description
In Chapter 6.4, you wrote averageRainfall
. That function computes the average of a
vector of rainfall measurements (Int
s), stopping at 999999, which marks the end of
the first data series. Any negative numbers in the input vector are ignored.
Now implement the same functionality as an interactive program. Place the program in
project HigherOrder, in RainfallApp.scala
The program should work precisely as illustrated in the examples below.
Enter rainfall (or 999999 to stop): 10
Enter rainfall (or 999999 to stop): 5
Enter rainfall (or 999999 to stop): 100
Enter rainfall (or 999999 to stop): 10
Enter rainfall (or 999999 to stop): 5
Enter rainfall (or 999999 to stop): -100
Enter rainfall (or 999999 to stop): -100
Enter rainfall (or 999999 to stop): 110
Enter rainfall (or 999999 to stop): 999999
The average is 40.
Enter rainfall (or 999999 to stop): 999999
No valid data. Cannot compute average.
Enter rainfall (or 999999 to stop): -123
Enter rainfall (or 999999 to stop): -1
Enter rainfall (or 999999 to stop): 999999
No valid data. Cannot compute average.
Instructions and hints
- In this assignment, you do not have to trouble yourself with what
happens if the user enters invalid values such as the empty string
"", "llama", or "3.141". You can assume that the entered text can
be successfully interpreted as an integer by the
toInt
method (Chapter 5.2). - You could solve the assignment by reading the inputs with a stream,
copying them into a vector (
toVector
), and then computing the average as you did in the earlier rainfall assignment. On the other hand, you don’t need to use a vector at all, since a stream also has the necessary methods.- For instance, it’s fine to call
size
andsum
on a stream, as long as it’s finite. - If you call those methods on a stream, use a
val
that refers to the stream. (Be careful not to accidentally create multiple streams that read input from the keyboard.)
- For instance, it’s fine to call
- Make sure to spell the program’s output exactly right so that the automatic grader will give you the points you deserve.
Submission form
A+ presents the exercise submission form here.
Streams of Numbers, Conveniently
A vector (or a buffer or any other strictly evaluated collection) won’t do, because all of the vector’s elements are stored in memory and an infinite vector would need an infinite amount of storage.
But a stream will do:
def positiveNumbers = Stream.from(1)positiveNumbers: Stream[Int]
positiveNumbers.take(3).foreach(println)1
2
3
from
, a helpful method
that creates a stream of increasing numbers. We pass in the first
number.Here we increment by ten at each element:
def evenTens = Stream.from(0, 10)evenTens: Stream[Int] evenTens.take(3).foreach(println)0 10 20
And here we increment by -1, which gives us a stream of decreasing numbers:
def negativeNumbers = Stream.from(-1, -1)negativeNumbers: Stream[Int] negativeNumbers.take(3).foreach(println)-1 -2 -3
Of course, from
also works in combination with other methods:
val firstBigSquare = Stream.from(0).map( n => n * n ).dropWhile( _ <= 1234567 ).headfirstBigSquare: Int = 1236544
Optional Reading: Different Kinds of Streams
A stream that iterates
iterate
is another method worth mentioning. It creates a stream that
generates each element by re-applying a function to the previous element:
def alternating = Stream.iterate(1)( x => -2 * x )alternating: Stream[Int] alternating.take(4).foreach(println)1 -2 4 -8 def babble = Stream.iterate("")( "blah" + _ )babble: Stream[String] babble.take(4).foreach(println) blah blahblah blahblahblah
The example below uses iterate
to generate a stream that implements an
algorithm that estimates the square root of a given number without resorting
to the sqrt
library function. (The program applies Newton’s method
for approximating square roots of positive numbers.)
def squareRoot(n: Double) = { def isTooFar(approx: Double) = (approx * approx - n).abs > 0.0001 def nextApprox(prev: Double) = (prev + n / prev) / 2 def streamOfApproximations = Stream.iterate(1.0)(nextApprox) streamOfApproximations.dropWhile(isTooFar).head }squareRoot: (n: Double)Double squareRoot(9)res12: Double = 3.000000001396984 squareRoot(654321)res13: Double = 808.9011064400888
nextApprox
repeatedly.A recursively defined stream
Methods such as continually
, from
, and iterate
work well for many needs.
But what if we didn’t have these factory methods? Or what if you want to define
a stream that doesn’t match what those methods do?
You can use recursion to tailor exactly the sort of stream that you need.
Below is a very simple example of a recursive (i.e., self-referential)
definition of a stream. This definition produces a stream of positive integers
(just like Stream.from(1)
).
def positiveNumbers(first: Int): Stream[Int] = first #:: positiveNumbers(first + 1)positiveNumbers: (first: Int)Stream[Int]
positiveNumbers(1).take(3).foreach(println)1
2
3
#::
combines a single value and a stream,
yielding another stream. The value to the left of the
operator becomes the first element; it’s followed by the
elements of the stream on the right-hand side.In the Scala API, methods such as from
and continually
have recursive
implementations. The same basic principle works for defining any kind of stream;
experiment!
Recursion is a versatile and powerful technique; streams are just one of its countless applications. Chapter 12.1 will say more.
A code-reading challenge
Here’s a more elaborate example of streams. Can you figure out what this function does and how it does it?
The names of the function and its components are vague but not deliberately misleading.
This is a math-y sort of program.
def mystery(limit: Int): Vector[Int] = {
import scala.math.sqrt
val odd = Stream.from(3, 2)
def candidates = odd.takeWhile( _ <= limit )
def initialVals = odd.takeWhile( _ <= sqrt(limit).toInt )
def multiples(n: Int) = Stream.from(n * n, 2 * n).takeWhile( _ <= limit )
def rejected = initialVals.flatMap(multiples)
val result = candidates.diff(rejected)
result.toVector
}
Optional Reading: Properties of Streams
Streams are non-strict and lazy
We’ve already established that all the elements of a stream aren’t necessarily generated at all; in other words, a stream is a non-strict collection. We also noted that a stream element is generated only when that element is actually needed for some purpose.
Let’s return to this example:
val randomStream = Stream.continually( Random.nextInt(100) )randomStream: Stream[Int] = Stream(12, ?)
randomStream.take(5).mkString(",")res14: String = 12,62,28,14,31
Random.nextInt(100)
only when or if a new element is needed.
In this example, that happens five times, since mkString
forces the stream to access five of its elements. In order to
do that, the stream needs to evaluate the randomizing expression
repeatedly.Careful now: an element is generated into the stream whenever a new element is needed. What if you reuse a previously generated element?
randomStream.take(5).mkString(",")res15: String = 12,62,28,14,31 randomStream.take(5).mkString(",")res16: String = 12,62,28,14,31 randomStream.take(10).mkString(",")res17: String = 12,62,28,14,31,27,79,18,78,43
In programmer-speak, a stream is not merely non-strict but lazy (laiska): it not only doesn’t generate elements unless it needs them but also stores the elements to save itself the future trouble of having to generate them again.
Laziness is even more conspicuous in the next example.
Let’s start by creating a function that generates new elements and notifies us about it:
var counter = 0counter: Int = 0 def generateElement() = { val newElem = counter println("I guess I have to bother to generate a new element: " + newElem) counter += 1 newElem }generateElement: ()Int
Now we form a stream that uses the above function to generate its elements:
val streamOfInts = Stream.continually( generateElement() )I guess I have to bother to generate a new element: 0 streamOfInts: Stream[Int] = Stream(0, ?)
toString
output. (N.B. In future versions
of the Scala API, this behavior will not occur; a stream will
not generate even the first element like this.)Examining the first five elements necessitates additional generateElement
calls:
streamOfInts.take(5).mkString(",")I guess I have to bother to generate a new element: 1 I guess I have to bother to generate a new element: 2 I guess I have to bother to generate a new element: 3 I guess I have to bother to generate a new element: 4 res18: String = 0,1,2,3,4
Re-examining them does not, and the stream doesn’t do anything it doesn’t need to:
streamOfInts.take(5).mkString(",")res19: String = 0,1,2,3,4
Examining further elements makes the stream work:
streamOfInts.take(10).mkString(",")I guess I have to bother to generate a new element: 5 I guess I have to bother to generate a new element: 6 I guess I have to bother to generate a new element: 7 I guess I have to bother to generate a new element: 8 I guess I have to bother to generate a new element: 9 res20: String = 0,1,2,3,4,5,6,7,8,9
val
, def
, streams, and memory usage
As long as you have a reference to a stream stored somewhere, you don’t lose the stream
to the garbage collector. This was illustrated in the examples just
above, where we had variables (randomStream
, streamOfInts
) that referred to our
streams. What if we hadn’t had them?
def randomStream = Stream.continually( Random.nextInt(100) )randomStream: Stream[Int] randomStream.take(5).mkString(",")res21: String = 42,72,7,86,30 randomStream.take(5).mkString(",")res22: String = 2,2,64,87,54
randomStream
calls a parameterless function.
Whenever we evaluate this expression, we get an entirely new
stream.Now that we didn’t store a reference to either of the random streams in a variable, the stream objects won’t stay in memory. The garbage collector promptly releases the memory for other use.
Let’s compare two more code fragments. Here’s the first one, in which a val
refers
to a stream:
val longStreamOfInputs = Stream.continually( readSomeDataFromSomewhere() )
longStreamOfInputs.map(doStuffWithDataElement).foreach(println)
longStreamOfInputs.map(doSomethingElseWithElement).foreach(println)
longStreamOfInputs
stores a reference to all the data.Here’s the same code but with a def
instead of a val
:
def longStreamOfInputs = Stream.continually( readSomeDataFromSomewhere() )
longStreamOfInputs.map(doStuffWithDataElement).foreach(println)
longStreamOfInputs.map(doSomethingElseWithElement).foreach(println)
This program doesn’t store the stream and its contents anywhere. In fact, even while the computer processes the stream one element at a time, each previously processed element becomes fair game for the garbage collector, which operates in the background. Especially if the stream is long, it can happen that the first elements have been processed and removed from memory before the last elements have even been generated.
Whether you store the stream in a variable can thus make a big difference. Circumstances dictate whether it’s a good idea to do so:
- If the stream is short, if there is plenty of memory available, and especially if the goal is to traverse the same elements multiple times, then it’s a good idea to use a variable.
- If the elements are traversed only once and each one can
be discarded as soon as it’s dealt with (as in our
SentimentAnalyzer
, for instance), thendef
is the better choice. This approach also has the benefit that all the elements don’t have to fit in memory at once.
When operating on small amounts of data, this difference between the alternatives is often negligible.
The internal implementation of streams is based on linking each element to the next element in the stream. This linked structure makes it possible to discard previously processed parts of the stream from memory, as described above; on the other hand, it also means that the only efficient way to process a stream is to advance linearly from the head rather than picking out elements at arbitrary indices (which is fine on buffers and vectors, for example). You’ll learn more about linked structures and efficiency in various follow-on courses, including Programming 2.
Lazy variables
What’s the difference between these two programs?
val result = "llama".length
println(result)
println(result)
def result = "llama".length
println(result)
println(result)
The answer should be clear enough at this stage. The first program,
takes care of business (computes the string’s length) right away
exactly once and stores the result in memory (in a val
). The
second, which def
ines a function, initially doesn’t compute the
result at all and never would if you never called the function;
on the other hand, it does the same job multiple times if you
call the function repeatedly. In other words, the second program
computes the length twice; the first program uses some additional
memory to avoid recomputing it.
We’ve established that a stream takes the middle approach between those two extremes: it takes care of business (generates elements) only upon request but then stores the result and avoids recomputing it. We called this behavior “lazy”.
An individual variable can also be lazy. In Scala, you define such
a variable by writing lazy val
:
lazy val result = "llama".lengthresult: Int = <lazy> println(result)5 println(result)5
"llama".length"
to the variable,
that expression is not yet evaluated. For now, the lazy
variable is linked to the unevaluated expression.Here’s another example. Let’s define a function and a couple of lazy variables.
def printAndReturn(number: Int) = { println("Returning my parameter value of " + number) number }printAndReturn: (number: Int)Int lazy val first = printAndReturn(1)first: Int = <lazy> lazy val second = printAndReturn(2)second: Int = <lazy>
Neither of the two variable definitions above caused printAndReturn
to be invoked yet, as you can see from the fact that the println
within the function didn’t trigger yet. Let’s continue:
if (first > 0) first * 10 else second * 10Returning my parameter value of 1 res23: Int = 10 if (first > 0) first * 10 else second * 10res24: Int = 10
first
, so it determines
this lazy variable’s value by calling printAndReturn
.
We get a line of output from the function.first * 10
next. first
already has a value, which
doesn’t get recomputed. Notice that we don’t get an
additional line of output, as we would have if first
had been a def
rather than lazy val
.printAndReturn
again, since first
already has a value.else
branch was never chosen, the value
of second
is unimportant. Indeed, the computer never
even computed a value for that variable.In some programming languages, most famously in Haskell, all variables are lazy and all expressions are evaluated non-strictly. In most programming languages, variables aren’t lazy and expression evaluation is largely strict.
In Scala, too, strict evaluation is the default, but as you have seen in this chapter, you can opt to use non-strict parameters and lazy collections and variables where you think it’s called for.
Summary of Key Points
- A stream is a collection whose elements aren’t formed as soon as
the collection is created. They are generated and stored as needed.
- A stream is intended for traversing the elements in linear order.
- You can process a stream’s elements one at a time without storing them all simultaneously in the computer’s memory or even knowing how many elements there are or will be.
- Unlike the previously familiar collections,
a stream can be either finite or infinite.
You can use methods such as
take
andtakeWhile
to select a finite part of an infinite stream. - There are many ways to form a stream.
For instance, you can repeat an operation
that creates elements (
continually
), produce elements in numerical order (from
), or repeatedly apply an operation to the previous element (iterate
).
- So-called by-name parameters are received by the function as
unevaluated expressions. This contrasts with ordinary (by-value)
parameters, which receive the previously computed values of
expressions.
- The receiving method may evaluate a by-name parameter once, more than once, or not at all, as appropriate.
- These parameters are highly useful for some purposes, such as forming streams.
- Links to the glossary: stream; by-name parameter; strict, non-strict, and lazy evaluation.
Soon to be LazyList
We’re using version 2.12 of Scala. From the brand-new
version 2.13 onwards, the Stream
class is replaced
by a class named LazyList
. We’ve used Stream
here,
but the basic idea is exactly the same for LazyList
.
Feedback
Please note that this section must be completed individually. Even if you worked on this chapter with a pair, each of you should submit the form separately.
Credits
Thousands of students have given feedback that has contributed to this ebook’s design. Thank you!
Weeks 1 to 13 of the ebook, including the assignments and weekly bulletins, have been written in Finnish and translated into English by Juha Sorva.
Weeks 14 to 20 are by Otto Seppälä. That part of the ebook isn’t available during the fall term, but we’ll publish it when it’s time.
The appendices (glossary, Scala reference, FAQ, etc.) are by Juha Sorva unless otherwise specified on the page.
The automatic assessment of the assignments has been developed by: (in alphabetical order) Riku Autio, Nikolas Drosdek, Joonatan Honkamaa, Jaakko Kantojärvi, Niklas Kröger, Teemu Lehtinen, Strasdosky Otewa, Timi Seppälä, Teemu Sirkiä, and Aleksi Vartiainen.
The illustrations at the top of each chapter, and the similar drawings elsewhere in the ebook, are the work of Christina Lassheikki.
The animations that detail the execution Scala programs have been designed by Juha Sorva and Teemu Sirkiä. Teemu Sirkiä and Riku Autio have done the technical implementation, relying on Teemu’s Jsvee and Kelmu toolkits.
The other diagrams and interactive presentations in the ebook are by Juha Sorva.
The O1Library software has been developed by Aleksi Lukkarinen and Juha Sorva. Several of its key components are built upon Aleksi’s SMCL library.
The pedagogy behind O1Library’s tools for simple graphical programming (such as Pic
)
is inspired by the textbooks How to Design Programs by Flatt, Felleisen, Findler, and
Krishnamurthi and Picturing Programs by Stephen Bloch.
The course platform A+ has been created by Aalto’s LeTech research group and is largely developed by students. The current lead developer is Jaakko Kantojärvi; many other students of computer science and information networks are also active on the project.
For O1’s current teaching staff, please see Chapter 1.1.
Additional credits for this page
The rainfall assignment is a version of a classic programming problem by Elliot Soloway.
The sentiment-analyzer assignment is an adaptation of a programming assignment by Eric D. Manley and Timothy M. Urness, which is in turn based on a programming contest on the web site Kaggle.