Luet oppimateriaalin englanninkielistä versiota. Mainitsit kuitenkin taustakyselyssä osaavasi suomea. Siksi suosittelemme, että käytät suomenkielistä versiota, joka on testatumpi ja hieman laajempi ja muutenkin mukava.

Suomenkielinen materiaali kyllä esittelee englanninkielisetkin termit. Myös suomenkielisessä materiaalissa käytetään ohjelmien koodissa englanninkielisiä nimiä kurssin alkupään johdantoesimerkkejä lukuunottamatta.

Voit vaihtaa kieltä A+:n valikon yläreunassa olevasta painikkeesta. Tai tästä: Vaihda suomeksi.


Chapter 7.1: Laziness and Untold Repetitions

About This Page

Questions Answered: How do I write a program that reads in an unspecified amount of input or repeats some other operation a initially unknown number of times? How can I operate on a bunch of data one at a time without first storing all of it in memory?

Topics: Finite and infinite lazy-lists. By-name parameters.

What Will I Do? Read and program.

Rough Estimate of Workload:? Two or three hours? This is a fairly long chapter, but much of it is optional reading.

Points Available: A210 + C5. (There isn’t all that much to do, but the points value is high because the assignments are intended as near-mandatory.)

Related Modules: Sentiments (new), HigherOrder.

../_images/person04.png

Introduction

As a warm-up for the main subject of this chapter, let’s consider another topic.

Suppose we intend to create a function fiveTimes. This function should return a five-element vector where each element is the value of a given Int expression:

fiveTimes(100)res0: Vector[Int] = Vector(100, 100, 100, 100, 100)
fiveTimes(1 + 1)res1: Vector[Int] = Vector(2, 2, 2, 2, 2)

We really want five separate evaluations of the same given expression. If different evaluations yield different results, that should be reflected in the output:

import scala.util.Randomimport scala.util.Random
fiveTimes(Random.nextInt(100))res2: Vector[Int] = Vector(43, 55, 21, 46, 87)
fiveTimes(Random.nextInt(100))res3: Vector[Int] = Vector(33, 65, 62, 31, 73)

Here’s an implementation for fiveTimes. Assess it.

def fiveTimes(numberExpression: Int) =
  Vector.tabulate(5)( anyIndex => numberExpression )

Which of the following claims are correct? Select all that apply.

Chapter 4.3 introduced you to the Option class and its method getOrElse.

Some(100).getOrElse(0)res4: Int = 100
None.getOrElse(0)res5: Int = 0
vectorOfWords.lift(-123).getOrElse("no word at that index")res6: String = "no word at that index"

Above, we used a literal as getOrElse’s parameter, but we could have used a compound expression as well:

None.getOrElse(1 + 1)res7: Int = 2
vectorOfNumbers.lift(-5).getOrElse(Random.nextInt(10))res8: Int = 8

There are a couple more expressions below. Work out how they behave.

Some(123).getOrElse(1 / 0)
Some(123).getOrElse(Random.nextInt(100))

Which of the following is correct?

Passing Unevaluated Parameters

Here’s a version of fiveTimes that works. It’s nearly identical to our earlier attempt:

def fiveTimes(numberExpression: =>Int) =
  Vector.tabulate(5)( anyIndex => numberExpression )
There is a lone arrow in front of the parameter type. It means that this is an unevaluated parameter, better known as a by-name parameter.

As we just re-established, ordinary parameters — also known as by-value parameters — are evaluated before the function call begins. What gets passed in is a value. In contrast, a by-name parameter is not evaluated before the function call begins: what gets passed in is an unevaluated expression. That parameter expression will be evaluated only when the function body uses it (if in fact it does use it). That is, the parameter is evaluated while the function is already running.

In case a function uses a by-name parameter multiple times, the parameter gets evaluated multiple times:

def fiveTimes(numberExpression: =>Int) =
  Vector.tabulate(5)( anyIndex => numberExpression )
tabulate forms a five-element vector, placing an Int at each of the indices from 0 to 4. Each element is obtained by evaluating numberExpression. If numberExpression is, say, Random.nextInt(100), five distinct random numbers will be placed in the vector.

You can think of by-name parameters as a sort of parameterless function that is passed to another function as a parameter.

Here’s an illustrated example of a by-name parameter:

In O1, you aren’t required to define methods that take by-name parameters. However, you will need to use such methods. There’s nothing much to it, and indeed you have already done so, since getOrElse on Options takes a by-name parameter. getOrElse has been defined to either a) return the contents of the Option wrapper and leave the parameter untouched, or b) evaluate the parameter expression and return the result if there’s nothing in the Option. This is why the method works as discussed above.

Another familiar example

In Chapter 5.1 you saw that the subexpression to the right of the logical operators && and || is evaluated only in case the subexpression on the left isn’t enough to decide the value of the logical expression. You also know that each of those operators is actually a method on Boolean objects (Chapter 5.2); the “subexpression to the right” is actually a by-name parameter of that method.

More jargon

If a parameter expression is evaluated at least once, the parameter is termed strict (tiukka). The parameter of fiveTimes is strict; so are all by-value parameters. If there is a possibility that a parameter does not get evaluated at all, it’s termed non-strict (väljä); getOrElse’s parameter is one example.

A+ presents the exercise submission form here.

Now to this chapter’s main subject.

Challenge: An Unknown Number of Repetitions

So far in O1, we have repeated commands by gathering data in a collection, then working through that collection one element at a time. The number of repetitions has been equal to the size of the collection, which has been known to our programs by the time we start traversing the collection. For example, we have often collected data in a vector, which is wholly stored in the computer’s memory; we have then used a for loop or a higher-order method to do something with some or all of the vector’s elements.

In contrast, consider these common scenarios:

  • We intend to read inputs from the user for the program to process until the user indicates they wish to stop. The program doesn’t know in advance how many inputs the user will enter before stopping; the user may not know, either.
  • We intend to process data from a source (a file, a network, a physical sensor governed by the program, etc.), dealing with a single element at a time. We don’t know the number of elements before starting and there could be so many of them that we don’t know if it’s possible to store all of them at once in the computer’s memory.
  • We intend to compute increasingly precise approximations of a desired quantity by repeatedly applying a particular computation; Newton’s method for finding roots is an example. We need an iterative process that applies the computation to the previous result until we reach sufficient precision. We don’t know in advance how many iterations it will take before that happens.

More generally still: we wish to repeat an operation an initially unknown number of times. In principle, there is no limit to the number of repetions; we might have a source that continuously generates more data, for instance.

A more concrete example

Let’s write a toy program that reports the lengths of four strings. There is nothing new here yet.

object LengthReport extends App {
  def report(input: String) = "The input is " + input.length + " characters long."
  val lines = Vector("a line of text", "another", "not written by the user", "but hardcoded into the program")
  lines.map(report).foreach(println)
}
We have an auxiliary function that forms a report of a single input.
We also have a collection with some example data. In this first version of the program, that collection is a vector and contains five specific strings.
The last command combines those components: we take the strings, generate a report from each one, and print out the reports.

Now, let’s undertake to edit that program so that it works like this, in the text console:

Enter some text: hello
The input is 5 characters long.
Enter some text: hello again
The input is 11 characters long.
Enter some text: third input
The input is 11 characters long.
Enter some text: fourth
The input is 6 characters long.
Enter some text: stop already
The input is 12 characters long.
Enter some text: stop
The input is 4 characters long.
Enter some text: please
Our program should not use any standard set of strings. It should prompt the user for input.
It should keep doing that until the user enters the word "please". Before that, the user may any number of inputs large or small.

How can we specify when the input-reading and report-printing should stop when we don’t know the number of inputs?

One way to do that is to use something called a lazy-list. Before we apply lazy-lists to that problem, however, let’s start from some basics.

LazyLists

A lazy-list (laiskalista) is collection of elements.

You can create a lazy-list by writing down each of its elements, just like you can create other collections:

val myLazyData = LazyList(10.2, 32.1, 3.14159)myLazyData: LazyList[Double] = LazyList(<not computed>)
The toString on LazyLists, which is here used by the REPL, doesn’t show the lazy-list’s elements. This is a symptom of how lazy-lists work — or rather, how they choose not to do work. We’ll say more about that in a bit.

Another approach is to create a lazy-list from an existing collection by calling the to method (Chapter 4.2):

val vectorOfWords = Vector("first", "second", "third", "fourth")vectorOfWords: Vector[String] = Vector(first, second, third, fourth)
val wordsLazily = vectorOfWords.to(LazyList)wordsLazily: LazyList[String] = LazyList(<not computed>)
This gives us a lazy-list that contains the same elements as the vector. We don’t see those elements here, though.

Lazy-lists have the collection methods you know. As an example, the following command skips past a couple of the first elements in the lazy-list, then picks out the first of the remaining ones:

wordsLazily.drop(2).headres9: String = third

A loop works, too:

for (word <- wordsLazily) {
  println("data from the lazy-list: " + word)
}data from the lazy-list: first
data from the lazy-list: second
data from the lazy-list: third
data from the lazy-list: fourth

As do higher-order methods:

wordsLazily.filter( _.length > 5 ).map( _ + "!" ).foreach(println)second!
fourth!

Nothing groundbreaking there. How are lazy-lists different?

Infinite Lists

A special thing about lazy-lists is that their size doesn’t need to be finite. You can define an endless lazy-list of values. (Cf. infinite sequences in math.)

The factory method continually creates an infinite LazyList:

val myWords = LazyList.continually("SPAM")myWords: LazyList[String] = LazyList(<not computed>)

We have just created a collection that has the string "SPAM" as each of its elements and that contains an infinite number of these identical elements. Let’s take the first five and print them out:

myWords.take(5).foreach(println)SPAM
SPAM
SPAM
SPAM
SPAM
The original lazy-list is infinite, but take returns a finite lazy-list of the given length. That finite collection contains some of the elements from the infinite one.
foreach doesn’t have an infinite amount of work to do since we apply it to only the five-element lazy-list returned by take.

Consider another example of an infinite lazy-list. (The example features the ++ operator from Chapter 4.2, which combines two collections.)

val lyrics = LazyList("Today I don't feel like doing anything", "I just wanna lay in my bed",
                      "Don't feel like picking up my phone", "So leave a message at the tone")lyrics: LazyList[String] = LazyList(<not computed>)
val combination = lyrics ++ LazyList.continually("NO MORE LYRICS")combination: LazyList[String] = LazyList(<not computed>)
combination.take(7).foreach(println)Today I don't feel like doing anything
I just wanna lay in my bed
Don't feel like picking up my phone
So leave a message at the tone
NO MORE LYRICS
NO MORE LYRICS
NO MORE LYRICS
We form a lazy-list that starts with four specific string elements, followed another string that repeats an arbitrary number of times.
If we take the first seven elements, we get the four different ones and three copies of the repeating element.

How do LazyLists Behave?

In the examples above, the infinite part of each collection was made up of identical elements. In the example below, that isn’t the case. Let’s form a lazy-list of pseudorandom numbers.

val randomInts = LazyList.continually( Random.nextInt(10) )randomInts: LazyList[Int] = LazyList(<not computed>)

Note that the above command did not generate a single random number yet. It only defined a collection whose elements are determined by generating random numbers when needed.

Let’s take a few numbers:

myNumbers.take(5).foreach(println)8
9
5
6
8

We told the computer to print the elements of a lazy-list that holds the first five elements of our infinite list. In order to print those elements, the lazy-list needs to compute them. In this case, the elements are random numbers.

Let’s see how our lazy-list looks now:

myNumbersres10: LazyList[Int] = LazyList(8, 9, 5, 6, 8, <not computed>)

The lazy-list has computed the first five elements, having been forced to do so by the print command. Those elements are now stored as part of the LazyList that myNumbers points to. Apart from those five, the other elements haven’t been computed since there has been no need for that.

If we again take at most five elements from the front of the lazy-list, we get the same random numbers that were previously generated and stored:

myNumbers.take(3).foreach(println)8
9
5
myNumbers.take(5).foreach(println)8
9
5
6
8

The collection has stored those elements and doesn’t re-evaluate the randomizing expression when we repeat our request.

If we take a slightly longer sequence and check its contents, the lazy-list comes up with more elements as needed. In this case, that means generating (and storing) additional random numbers to follow the previously generated ones:

myNumbers.take(7).foreach(println)8
9
5
6
8
4
1
myNumbersres11: LazyList[Int] = LazyList(8, 9, 5, 6, 8, 4, 1, <not computed>)

What is laziness?

Obviously, a computer cannot store an infinite number of distinct elements, which is where lazy-lists’ main feature comes in: the whole list isn’t constructed as soon as you create a LazyList object. Instead, the lazy-list computes new elements only as needed.

For instance, in order to run foreach(println) above, it was necessary to generate some random numbers by repeatedly evaluating Random.nextInt(10). Those parts of the lazy-list were actually stored in memory, but only when actually needed and only to the extent necessary.

Earlier in this chapter, we discussed by-name parameters, which the called function evaluates as needed. It is precisely such a by-name parameter that we passed to continually. The continually method forms a lazy-list that keeps evaluating that by-name parameter whenever we access the lazy-list for a new element.

A collection that computes its elements only as needed and stores them only from that point onwards is called lazy (laiska).

Each of the questions below assumes that (only) the following commands have been previously entered:

import scala.util.Randomimport scala.util.Random
def myInts(limit: Int) = LazyList.continually( Random.nextInt(limit) )myInts: (limit: Int)LazyList[Int]

That is, we have a myInts function that returns a new LazyList of random numbers.

You can try the given code in the REPL. If you do, notice that you can press the red Stop button on the left to terminate the REPL session if necessary.

println(myInts(1000))

What does the above code do?

println(myInts(1000).take(6))

What does the above code do?

println(myInts(1000).take(6).mkString(" "))

What does the above code do?

println(myInts(1000).mkString(" "))

What does the above code do?

for (number <- myInts(1000).take(6)) {
  println(number)
}

What does the above code do?

for (number <- myInts(1000)) {
  println(number)
}

What does the above code do?

println(myInts(1000).take(6).sum)

What does the above code do?

println(myInts(1000).size)

What does the above code do?

val myList = myInts(1000)
myList.take(6).foreach(println)
println(myList.size)

What does the above code do?

Collections of unknown size

It’s possible — and useful — to create collections whose size is determined only while using the collection.

Let’s again generate some random integers between, say, 0 and 99. In this example, we won’t generate a predetermined number of integers; instead, we’ll create them and print them out until we happen to hit an integer that exceeds 90:

LazyList.continually( Random.nextInt(100) ).takeWhile( _ <= 90 ).mkString(",")res12: String = 31,84,16,45,72,81,41,36,87,19,79,62,13,60,47,45,66,58,85,15,8,9,7,30,68,41,48,80,21,78,72,27
LazyList.continually( Random.nextInt(100) ).takeWhile( _ <= 90 ).mkString(",")res13: String = 0,65,83,38,75,33,11,18,75,51,3

takeWhile returns a lazy-list that terminates just before the first element that fails to match the given criterion. Even though our original list is infinite, the resulting list is not. Like continually and take, takeWhile does not generate any numbers; it merely returns a LazyList object that is capable of yet generating numbers until a terminating element is reached.

mkString produces a string that describes all the elements in the collection, which means that each element must be computed. On a finite lazy-list, that works. In this example, the lazy-lists’ contents are random and we get lists of different lengths when we run the command multiple times.

Exploring laziness

The example below makes laziness even more conspicuous.

Let’s start by creating a helper function that generates new elements and notifies us about it:

var counter = 0counter: Int = 0
def generateElement() = {
  counter += 1
  println("I guess I have to bother to generate a new element: " + counter)
  counter
}generateElement: ()Int

Now we form a lazy-list that uses the above function to generate its elements. Not that it "bothers" to initially call the function even once, since it doesn’t need to.

val lazyInts = LazyList.continually( generateElement() )lazyInts: LazyList[Int] = LazyList(<not computed>)

Examining the first three elements forces the lazy-list to call generateElement, as evidenced by the printout:

lazyInts.take(3).mkString(",")I guess I have to bother to generate a new element: 1
I guess I have to bother to generate a new element: 2
I guess I have to bother to generate a new element: 3
res14: String = 1,2,3

Re-examining them does not, and the lazy-list doesn’t do anything it doesn’t need to:

lazyInts.take(3).mkString(",")res15: String = 1,2,3

Examining further elements forces the lazy-list to do additional work:

lazyInts.take(5).mkString(",")I guess I have to bother to generate a new element: 4
I guess I have to bother to generate a new element: 5
res16: String = 1,2,3,4,5

Terminology: LazyLists are non-strict and lazy

We’ve already established that a lazy-list computes elements only when necessary. And if an element isn’t needed at all, it is never computed. In other words, a lazy-list is a non-strict collection in the same sense as a by-name parameter is a non-strict variable.

We also used the word lazy to describe collections. This means that:

  • the collection’s elements are evaluated non-strictly; and
  • the collection stores the elements that have already been evaluated so that there is no need to evaluate them repeatedly.

For instance, a lazy-list of random integers both avoids generating additional integers and also stores the generated integers to save itself the future trouble of having to generate them again.

Back to Our Input-Processing Program

We are now going to use lazy-lists for handling user input. As we do so, the principle of generating elements lazily when needed will be crucial.

Earlier, we came up with this program that reports the lengths of four specific strings.

object LengthReport extends App {
  def report(input: String) = "The input is " + input.length + " characters long."
  val lines = Vector("a line of text", "another", "not written by the user", "but hardcoded into the program")
  lines.map(report).foreach(println)
}

What we wanted instead is a program that reports on the lengths of an arbitrary number of inputs and terminates with "please". With lazy-lists, this is easy to accomplish. Here’s an almost working implementation:

object SayPlease extends App {
  def report(input: String) = "The input is " + input.length + " characters long."
  val inputs = LazyList.continually( readLine("Enter some text: ") )
  inputs.map(report).foreach(println)
}

The only difference to the previous version is: we don’t return a vector of predefined strings but a lazy-list that “brings” inputs for the program to process. We have here an infinite list of strings that are generated by asking the user to provide them. This command alone does not yet prompt the user for any inputs! It only defines a collection whose elements come from calling readLine whenever we need a new element.

map returns a lazy-list of reports, whose each element is formed (as needed) by calling readLine and applying report to the resulting string. This command also doesn’t prompt the user for the inputs, nor does it call report on them. It simply prepares a lazy-list that does that if and when we later access the elements.

We use foreach to print out the elements of the report list. Before it can process an element, the lazy-list object is forced to determine what that element is by prompting the user for input and applying report.

In practice, what we get is a program that repeatedly receives keyboard input and reports its length.

That pretty much works. But we didn’t attend to the magic word yet: the above program just keeps prompting for new inputs till the cows come home. This one doesn’t:

object SayPlease extends App {
  def report(input: String) = "The input is " + input.length + " characters long."
  val inputs = LazyList.continually( readLine("Enter some text: ") )
  inputs.takeWhile( _ != "please" ).map(report).foreach(println)
}

takeWhile cuts off the lazy-list at "please". No more elements (user inputs) will be generated once that element is reached. (Cf. the earlier example of takeWhile on random numbers.)

There’s a piece missing from this REPL example:

import scala.util.Randomimport scala.util.Random
val sumOfSixDieRolls = LazyList.continually( Random.nextInt(6) ).???.sumsumOfSixDieRolls: Int = 19

What can you write in place of the question marks to make the code sum up six numbers, each between 1 and 6 (inclusive)?

import scala.io.StdIn._

object Program extends App {
  val linesEnteredByUser = LazyList.continually( readLine("Write something: ") )
  val result = linesEnteredByUser.dropWhile( _.length < 5 ).head
  println(result)
}

What does that program print out?

import scala.io.StdIn._

object ExampleApp1 extends App {
  val userInputs = LazyList.continually( readLine("Enter a number: ") )
  println(userInputs.take(4).mkString(","))
  println(userInputs.take(4).map( _.toInt ).product)
}

What does that program do? (Let’s assume the user types in integer numbers and doesn’t interrupt the program.)

import scala.io.StdIn._

object ExampleApp2 extends App {
  val firstInputs  = LazyList.continually( readLine("Enter a number: ") )
  val secondInputs = LazyList.continually( readLine("Enter a number: ") )
  println(firstInputs.take(4).mkString(","))
  println(secondInputs.take(4).map( _.toInt ).product)
}

This program is the same as the previous one except that there are two lazy-lists.

Which of the following correctly describes what this program does?

import scala.io.StdIn._

object ExampleApp3 extends App {
  def userInputs = LazyList.continually( readLine("Enter a number: ") )
  println(userInputs.take(4).mkString(","))
  println(userInputs.take(4).map( _.toInt ).product)
}

This program is identical to ExampleApp1 above except that userInputs is defined as a function. That function is called twice.

Which of the following correctly describes what this program does?

Study the following program. You may also want to copy it in a file and run it.

If you have difficulty understanding the code, revisit Chapter 6.4’s averageRainfallFromStrings assignment and its example solution.

import scala.io.StdIn._

object ExampleApp4 extends App {
  def inputs = LazyList.continually( readLine("Enter a number: ") )
  val result = inputs.flatMap( _.toIntOption ).head
  println(result)
}

Which of the following claims are correct?

Assignment: Sentiments about Movies

Let’s create a program that reads in the user’s short textual comments about movies. The program then tries to figure out whether each of those user inputs is positive or negative in tone. That estimate will be based on thousands of earlier movie reviews that have been written by real people, manually marked as positive or negative by a human, and made accessible to the program.

The more difficult and laborsome parts of this program have already been written. You’ll only need to put the user interface in order.

Task description

Your task is to fetch Sentiments and complete its user interface in MovieSentimentApp.scala. The UI should work in the text console as shown below.

Please comment on a movie or hit Enter to quit: This is a masterpiece, a truly fantastic work of cinema art.
I think this sentiment is positive. (Average word sentiment: 0.36.)

Please comment on a movie or hit Enter to quit: I hated it.
I think this sentiment is negative. (Average word sentiment: -0.41.)

Please comment on a movie or hit Enter to quit: The plot had holes in it.
I think this sentiment is negative. (Average word sentiment: -0.28.)

Please comment on a movie or hit Enter to quit: Adam Sandler
I think this sentiment is negative. (Average word sentiment: -0.80.)

Please comment on a movie or hit Enter to quit: It was great.
I think this sentiment is positive. (Average word sentiment: 0.04.)

Please comment on a movie or hit Enter to quit: It wasn't great.
I think this sentiment is negative. (Average word sentiment: -0.16.)

Please comment on a movie or hit Enter to quit: It was "great".
I think this sentiment is positive. (Average word sentiment: 0.04.)

Please comment on a movie, or hit Enter to quit: 
Bye.
The program doesn’t even try for a finespun analysis. It just lumps the user’s inputs in two categories: positive and negative.
The program also reports a number that it used in making each decision. A positive number means that, on average, the individual words of the input tend to occur in positive movie reviews, judging by the genuine data the program has been trained with.
If the input is blunt enough, the program classifies it fairly accurately. It doesn’t get punctuation, though, or many of the other subtleties of human communication.
The user stops the program with an empty input. (The input must be genuinely empty with no characters, not even spaces.)

The training data tells our program which words occur in positive reviews and which words occur in negative ones. You can find that data in sample_reviews_from_rotten_tomatoes.txt within the Sentiments module. There’s no need for you to touch that data, but do take a look; it will help you understand how the program works.

The analyzer’s programmatic implementation is in SentimentAnalyzer.scala. Go ahead and browse the Scaladocs and the code if you want (but it's not required).

Implement the user interface in MovieSentimentApp.scala in package o1.sentiment.ui. The parts of the puzzle are there already, but you’ll need to use a lazy-list and its methods to piece them together.

Instructions and hints

  • The UI should work much like our earlier example that reports the lengths of the inputs.
  • In fact, there is rather little that you need to do. You can solve the entire assignment with a couple of lines of code, or even just one.
  • The given analyzer is a simple example of automatic sentiment analysis. It may also be considered a very primitive example of machine learning, depending on how exactly you define that term. You’ll find some optional material on these topics below.

A+ presents the exercise submission form here.

Sentiment analysis

Our example program categorizes emotions with an algorithm that is so simple that it’s even a bit surprising how often it’s accurate. The program reads in the training data, counts how often each different word appears in positive and negative reviews, and assigns each of those words a “positivity score”. When the user types in something, the program finds all the known words in the input and computes the average of those words’ positivity scores.

That algorithm is implemented in class SentimentAnalyzer; take a look if you want. However, the code will be easier to understand once we’ve reached Week 9; we’ll remind you then.

People have come up with considerably more sophisticated and accurate algorithms for analyzing sentiments. In fact, this assignment is based on a programming competition on the Kaggle web site.

Automatic sentiment analysis has diverse applications: it can improve the product recommendations that web sites give to potential buyers, for instance, and help humans in various diagnosis tasks. Sentiment analysis may be based on texts, as in our example, but can also take other forms; for more information, see the relevant article on Wikipedia.

Machine learning

Machine learning (koneoppiminen) is a white-hot subfield of computer science. In recent years, machine learning has grown into a constant topic of conversation at universities, in the business world, and in society at large. We hear about machines that learn to drive cars, predict the weather and the stock market, filter out spam, play board games, recognize objects and faces in images, recognize emotions from the tiniest changes in skin color, translate between languages, diagnose illnesses and security threats, evaluate insurance claims, sort garbage, and do a seemingly endless number of other things.

Our sentiment-analysis program can be interpreted as a primitive example of machine learning, at least if we pick a suitably wide definition of this somewhat contested term.

At least, our program has several characteristics that are typical of machine learning:

  • The program receives training data, which it uses to “learn” about a particular phenomenon that the program specializes in. In this case, the training data consists of a file of annotated movie reviews, from which the program learns how positive and negative sentiments tend to be expressed. In other cases, the training data might consist of other kinds of text, images, numbers, etc.
  • The program uses the training data to determine parameters for its core algorithm: the data defines each word’s “positivity score”. The key point is that the program code itself doesn’t say anything about how positive or negative a word might be; it’s all in the data.
  • Using what it learned from the data, the program is capable of dealing with novel situations as long as they are sufficiently similar to the training data. More specifically, this program is capable of analyzing new movie reviews.

Some aspects of learning are conspicuously absent, however. We haven’t systematically assessed how accurate our program’s analyses are. Neither did we use such assessments to improve the algorithm, let alone having the program assess and improve itself automatically. Many more sophisticated applications of machine learning collect more data as they run and adjust their behavior accordingly. (A fun example is Google’s Quick, Draw! game.)

Machine learning features in many courses at Aalto. Programming 2 gives a brief introduction, and several later courses specifically deal with this topic, such as Machine Learning: Basic Principles. Moreover, there are several research groups at Aalto that apply machine learning and develop new machine-learning algorithms.

Was that artificial intelligence?

You decide.

Artificial intelligence or AI (tekoäly; keinoäly) is another fashionable but ambiguous term.

In the traditional sense, AI refers to the endeavor of artificially creating human-like intelligence: the attempt to model how humans think and to reproduce those processes computationally.

These days, the term means a variety of other things, too.

In casual conversation and marketing speech, AI might refer to just about any application that does something fancy enough to be called intelligent. Such an application perhaps makes people think that “that’s usually done by people, not machines” or “I didn’t think that machines can do that already”. In this everyday sense, our sentiment analyzer may count as AI in case its behavior is suitably impressive.

In the current news media, AI often refers to novel applications of machine learning, such as those listed above. These applications generally have little to do with human-like cognition; that isn’t their purpose at all. Quite the opposite, in fact: these programs construct mathematical models from input data and make use of the strengths that computers have over humans: speed, precision, and tirelessness.

When something like a computer winning at Chess or IBM’s Watson answering questions is marketed as AI, some pioneers of traditional AI get angry. (Claims about the annihilation of humankind by AI-gone-rogue can make them very angry.) Nevertheless, the word’s meaning appears to have permanently loosened from what it was. Perhaps AI will one day solve this terminological schism?

Aalto’s course CS-E4800 Artificial Intelligence introduces some forms of AI. The University of Helsinki offers a brief introductory course on AI as a MOOC.

Narrow, general, weak, and strong AI

Many experts are careful to draw a distiction between narrow and general AI.

Narrow AI refers to systems with “intelligence” that is limited to a particular specialization. Nearly all of the much-publicized applications of machine learning fall under narrow AI.

If we teach it to recognize cars [in images], all it does is recognize cars. There is no chance that it ends up doing something else.

—Timo Aila, researcher of machine learning (translated from an article in Finnish)

General AI is a still much loftier goal: intelligence that isn’t constrained to a specific domain but is capable of broad comprehension, learning about different domains and combining its knowledge of them. A general AI system could conceivably be programmed to develop narrow AI subsystems that deal with specific domains.

You’ll also hear the terms weak and strong AI, which mean roughly the same as narrow and general AI. This second set of terms sometimes carries the additional connotation that a strong AI system — unlike a weak one — is conscious on some level and might experience feelings.

Assignment: Rainfall

This assignment provides some more practice on lazy-lists and keyboard input.

Task description

In Chapter 6.4, you wrote averageRainfall. That function computes the average of a vector of rainfall measurements (Ints), stopping at 999999, which marks the end of the first data series. Any negative numbers in the input vector are ignored.

Now implement the same functionality as an interactive program. Place the program in module HigherOrder, in RainfallApp.scala

The program should work precisely as illustrated in the examples below.

Enter rainfall (or 999999 to stop): 10
Enter rainfall (or 999999 to stop): 5
Enter rainfall (or 999999 to stop): 100
Enter rainfall (or 999999 to stop): 10
Enter rainfall (or 999999 to stop): 5
Enter rainfall (or 999999 to stop): -100
Enter rainfall (or 999999 to stop): -100
Enter rainfall (or 999999 to stop): 110
Enter rainfall (or 999999 to stop): 999999
The average is 40.
Enter rainfall (or 999999 to stop): 999999
No valid data. Cannot compute average.
Enter rainfall (or 999999 to stop): -123
Enter rainfall (or 999999 to stop): -1
Enter rainfall (or 999999 to stop): 999999
No valid data. Cannot compute average.

Instructions and hints

  • In this assignment, you do not have to trouble yourself with what happens if the user enters invalid values such as the empty string "", "llama", or "3.141". You can assume that the entered text can be successfully interpreted as an integer by the toInt method (Chapter 5.2).
  • You could solve the assignment by reading the inputs with a lazy-list, copying them into a vector (toVector), and then computing the average as you did in the earlier rainfall assignment. On the other hand, you don’t need a vector at all, since a lazy-list also has the necessary methods.
    • For instance, it’s fine to call size and sum on a lazy-list, as long as it’s finite.
    • If you do, use a val that refers to the lazy-list. (Don’t accidentally create multiple lazy-lists that read input from the keyboard as in the multiple-choice question above.)
  • Make sure to spell the program’s output exactly right so that the automatic grader gives you the points you deserve.

A+ presents the exercise submission form here.

Optional Reading: LazyLists and Efficiency

It’s already come up several times that the ostensibly little choice between def and val makes quite a difference sometimes. This optional section elaborates on how that choice matters for LazyLists.

val, def, LazyLists, and memory usage

An introductory example

As long as you have a reference to a lazy-list stored somewhere, you don’t lose the list to the garbage collector. This was illustrated in the earlier examples where we had variables (vals) that referred to our lazy-lists`, as we have here:

val myRandomInts = LazyList.continually( Random.nextInt(100) )myRandomInts: LazyList[Int] = LazyList(<not computed>)
myRandomInts.take(5).mkString(",")res17: String = 11,70,28,1,93
myRandomInts.take(5).mkString(",")res18: String = 11,70,28,1,93

We got the same numbers each time, since we had just the one lazy-list. What if we hadn’t used a val but a function that returns a reference to a new lazy-list:

def myRandomInts = LazyList.continually( Random.nextInt(100) )myRandomInts: LazyList[Int]
myRandomInts.take(5).mkString(",")res19: String = 42,72,7,86,30
myRandomInts.take(5).mkString(",")res20: String = 2,2,64,87,54

The expression myRandomInts calls a parameterless function. Whenever the computer evaluates this expression, we get an entirely new lazy-list. The two lazy-lists are independent of each other and have distinct elements.

Now that we didn’t store a reference to either collection, the LazyList objects won’t stay in memory. The garbage collector promptly releases the memory for other use.

Efficiency concerns

Let’s compare another two code fragments. Here’s the first fragment, in which a val refers to a lazy-list:

val longListOfInputs = LazyList.continually( readSomeDataFromSomewhere() )
longListOfInputs.map(doStuffWithDataElement).foreach(println)
longListOfInputs.map(doSomethingElseWithElement).foreach(println)
We store the lazy-list reference in a variable. We intend to read a large but finite amount of data from a source (a file, the internet, the keyboard; somewhere).
When we subsequently command the lazy-list to process its elements and print out the results, the entire list is traversed and each of its elements generated with a function that reads a new element into memory. The variable longListOfInputs stores a reference to all the data.
Later, we take the previously stored inputs and process them in some other way. The latter traversal doesn’t fetch the inputs again: it uses the elements stored by the lazy-list.

Here’s the same code but with a def instead of a val:

def longListOfInputs = LazyList.continually( readSomeDataFromSomewhere() )
longListOfInputs.map(doStuffWithDataElement).foreach(println)
longListOfInputs.map(doSomethingElseWithElement).foreach(println)

This program doesn’t store the lazy-list and its contents anywhere. In fact, even while the computer processes the list one element at a time, each previously processed element becomes fair game for the garbage collector, which operates in the background. Especially if the list is long, it can happen that the first elements have been processed and removed from memory before the last elements have even been generated.

Now that the input data is not stored anywhere, the second traversal causes all the input to be re-read.

Which solution is better depends on circumstances:

  • If the lazy-list is short, if there is plenty of memory available — and especially if the goal is to traverse the same elements multiple times — then it’s a good idea to use a variable.
  • If the elements are traversed only once and each one can be discarded as soon as it’s dealt with (as in our SentimentAnalyzer, for instance), then def is the better choice. This approach also has the benefit that all the elements don’t have to fit in memory at once.

When operating on small amounts of data, this difference between the alternatives is often negligible.

The internals of LazyList

The internal implementation of lists is based on linking each element to the next element in the list.

The linked structure makes it possible to discard previously processed parts of the list from memory as described above. On the other hand, it also means that the only efficient way to process a list is to advance linearly from the head rather than picking out elements at arbitrary indices (which is fine on buffers and vectors, for example).

You’ll learn more about linked structures and efficiency in various follow-on courses, including Programming 2.

Number Sequences, Conveniently

Is it possible to create an infinite vector with all the natural numbers, or something like that?

A vector (or a buffer or any other strictly evaluated collection) won’t do, because all of a vector’s elements are stored in memory and an infinite vector would need an infinite amount of storage.

But a lazy-list will do:

val positiveNumbers = LazyList.from(1)positiveNumbers: LazyList[Int] = LazyList(<not computed>)
positiveNumbers.take(3).foreach(println)1
2
3

As it happens, the Scala API gives us from, a helpful method that creates a lazy-list of increasing numbers. We pass in the first number.

Here we increment by ten at each element:

val evenTens = LazyList.from(0, 10)evenTens: LazyList[Int] = LazyList(<not computed>)
evenTens.take(3).foreach(println)0
10
20

And here we increment by -1, which gives us a lazy-list of decreasing numbers:

val negativeNumbers = LazyList.from(-1, -1)negativeNumbers: LazyList[Int] = LazyList(<not computed>)
negativeNumbers.take(3).foreach(println)-1
-2
-3

Of course, from also works in combination with other methods:

val firstBigSquare = LazyList.from(0).map( n => n * n ).dropWhile( _ <= 1234567 ).headfirstBigSquare: Int = 1236544
val words = Vector("first", "second", "third")words: Vector[String] = Vector(first, second, third)
val myList = LazyList.from(0).map( n => words(n % words.size) )myList: LazyList[String] = LazyList(<not computed>)

myList refers to a lazy-list. Which of the following options correctly describes that lazy-list?

Optional Reading: Different Kinds of Lazy-Lists

A lazy-list that iterates

iterate is another method worth mentioning. It creates a collection that generates each element by re-applying a function to the previous element:

val alternating = LazyList.iterate(1)( x => -2 * x )alternating: LazyList[Int] = LazyList(<not computed>)
alternating.take(4).foreach(println)1
-2
4
-8
val babble = LazyList.iterate("")( "blah" + _ )babble: LazyList[String] = LazyList(<not computed>)
babble.take(4).foreach(println)
blah
blahblah
blahblahblah

The example below uses iterate to generate a lazy-list that implements an algorithm for estimating the square root of a given number, without resorting to the sqrt library function. (The program applies Newton’s method for approximating square roots of positive numbers.)

def squareRoot(n: Double) = {
  def isTooFar(approx: Double) = (approx * approx - n).abs > 0.0001
  def nextApprox(prev: Double) = (prev + n / prev) / 2
  def approximations = LazyList.iterate(1.0)(nextApprox)
  approximations.dropWhile(isTooFar).head
}squareRoot: (n: Double)Double
squareRoot(9)res21: Double = 3.000000001396984
squareRoot(654321)res22: Double = 808.9011064400888
The first of three auxiliary functions checks whether a given approximation isn’t close enough to the desired solution.
The second function uses an earlier approximation to generate another, as per Newton’s method.
The third generates an infinite list of approximate solutions. The list starts with a “guess” of 1.0 and generates better approximations by applying nextApprox repeatedly.
We get a good-enough approximation by dropping all those approximations at the front that aren’t close enough. Then we pick out the first element that remains.

A recursively defined lazy-list

Methods such as continually, from, and iterate work well for many needs. But what if you didn’t have these factory methods? Or what if you want to define a lazy-list that doesn’t match what those methods do?

You can use recursion to tailor exactly the sort of collection that we need.

Below is a very simple example of a recursive (i.e., self-referential) definition of a lazy-list. This definition produces a sequence of positive integers (just like LazyList.from(1)).

def positiveNumbers(first: Int): LazyList[Int] = first #:: positiveNumbers(first + 1)positiveNumbers: (first: Int)LazyList[Int]
positiveNumbers(1).take(3).foreach(println)1
2
3
The operator #:: combines a single value and a lazy-list, yielding another lazy-list. The value to the left of the operator becomes the first element; it’s followed by the elements of the list on the right-hand side.
The definition is recursive: it refers to itself. We form a sequence of positive integers by starting at the given integer and following it with a sequence of positive integers that starts at the next integer.

In the Scala API, methods such as from and continually have recursive implementations. The same basic principle works for defining any kind of LazyList. Try it!

Recursion is a versatile and powerful technique; lazy-lists are just one of its countless applications. Chapter 12.1 will say more.

A code-reading challenge

Here’s a more elaborate example. Can you figure out what this function does and how it does it?

The names of the function and its components are vague but not deliberately misleading.

This is a math-y sort of program.

def mystery(limit: Int): Vector[Int] = {
  import scala.math.sqrt

  val odd = LazyList.from(3, 2)
  def candidates = odd.takeWhile( _ <= limit )

  def initialVals = odd.takeWhile( _ <= sqrt(limit).toInt )
  def multiples(n: Int) = LazyList.from(n * n, 2 * n).takeWhile( _ <= limit )
  def rejected = initialVals.flatMap(multiples)

  val result = candidates.diff(rejected)
  result.toVector
}

Optional Reading: Lazy Variables and “Regular Lists”

Lazy variables

What’s the difference between these two programs?

val result = "llama".length
println(result)
println(result)
def result = "llama".length
println(result)
println(result)

The answer should be clear enough at this stage. The first program takes care of business (computes the string’s length) right away exactly once and stores the result in memory (in a val). The second, which defines a function, initially doesn’t compute the result at all and never would if you never called the function; on the other hand, it does the same job multiple times if you call the function repeatedly.

In other words, the second program computes the length twice. The first program uses some additional memory to avoid recomputing it.

We’ve established that lazy-lists take the middle approach between those two extremes: a lazy-list takes care of business (generates elements) only upon request but then stores the result and avoids recomputing it. We called this behavior “lazy”.

An individual variable can also be lazy. In Scala, you define such a variable by writing lazy val:

lazy val result = "llama".lengthresult: Int = <lazy>
println(result)5
println(result)5
Even though we assigned "llama".length" to the variable, that expression is not yet evaluated. For now, the lazy variable is linked to the unevaluated expression.
The REPL’s output of <lazy> means that this variable doesn’t have a value yet.
When we actually do something with the variable (here: we print its value), the computer needs to evaluate the assigned expression. It is only at this point that the string’s length is computed and an integer gets stored in the variable.
The second print command looks the same as the first. However, the computer does not recompute the string length at this point. It doesn’t have to bother because the length is already stored in the variable.

Here’s another example. Let’s define a function and a couple of lazy variables.

def printAndReturn(number: Int) = {
  println("Returning my parameter value of " + number)
  number
}printAndReturn: (number: Int)Int
lazy val first = printAndReturn(1)first: Int = <lazy>
lazy val second = printAndReturn(2)second: Int = <lazy>

Neither of the two variable definitions above caused printAndReturn to be invoked yet, as you can see from the fact that the println within the function didn’t trigger yet. Let’s continue:

if (first > 0) first * 10 else second * 10Returning my parameter value of 1
res23: Int = 10
if (first > 0) first * 10 else second * 10res24: Int = 10
In order to evaluate the conditional expression, the computer needs to evaluate first, so it determines this lazy variable’s value by calling printAndReturn. We get a line of output from the function.
The first branch is chosen, so the computer evaluates first * 10 next. first already has a value, which doesn’t get recomputed. Notice that we don’t get an additional line of output, as we would have if first had been a def rather than lazy val.
Even issuing the entire command again doesn’t trigger printAndReturn again, since first already has a value.
Since the else branch was never chosen, the value of second is unimportant. Indeed, the computer never even computed a value for that variable.

In some programming languages, most famously in Haskell, all variables are lazy and all expressions are evaluated non-strictly. In most programming languages, variables aren’t lazy and expression evaluation is largely strict.

In Scala, too, strict evaluation is the default, but as you have seen in this chapter, you can opt to use non-strict parameters and lazy collections and variables where you think it’s called for.

A short introduction to “regular lists”

In this chapter, we’ve discussed lazy-lists and the corresponding Scala class. There are also lists that aren’t lazy.

Lists (lista) are immutable collections and, in that, resemble vectors. They are especially common in functional programming, and many Scala programmers use them a lot. (Lists will also play a significant part in Programming 2.)

Lists work much like other collections. Here’s one way to create a new list:

val emptyList = List[Int]()emptyList: List[Int] = List()
val wordList = List("first", "second", "third")wordList: List[String] = List(first, second, third)

The familiar methods are available:

wordList.sizeres25: Int = 3
wordList.tailres26: List[String] = List(second, third)
wordList.map( _.length )res27: List[Int] = List(3, 4, 6)
val longer = "first-er" :: wordListwordList: List[String] = List(first-er, first, second, third)
The :: operator is specific to lists. It forms a new list that has one additional element; it is thus equivalent to the generic collection operator +: (Chapter 4.2) and the lazy-list operator #::.

Because of how they are implemented, lists are efficient if you need to manipulate only the head of the collection or if you need to traverse the elements only in order from first to last. On the other hand, a list is probably a poor choice if you intend to access arbitrary elements anywhere in the collection (by their index, for instance).

Summary of Key Points

  • A lazy-list is a collection whose elements aren’t formed as soon as the collection is created. The elements are generated and stored as needed.
    • A lazy-list is intended for traversing the elements in linear order.
    • You can process a lazy-list’s elements one at a time without storing them all simultaneously in the computer’s memory or even knowing how many elements there are or will be.
    • Unlike the previously familiar collections, a lazy-list can be either finite or infinite. In Scala, you can use methods such as take and takeWhile to select a finite part of an infinite LazyList.
    • There are many ways to form a LazyList. For instance, you can repeat an operation that creates elements (continually), produce elements in numerical order (from), or repeatedly apply an operation to the previous element (iterate).
  • By-name parameters are received by the function as unevaluated expressions. This contrasts with ordinary (by-value) parameters, which receive the previously computed values of expressions.
    • The receiving method may evaluate a by-name parameter once, more than once, or not at all, as appropriate.
    • These parameters are useful for some purposes, such as forming lazy-lists.
  • Links to the glossary: lazy-list (or stream); by-name parameter; strict, non-strict, and lazy evaluation. Also mentioned: machine learning, artificial intelligence.

Feedback

Please note that this section must be completed individually. Even if you worked on this chapter with a pair, each of you should submit the form separately.

Credits

Thousands of students have given feedback that has contributed to this ebook’s design. Thank you!

The ebook’s chapters, programming assignments, and weekly bulletins have been written in Finnish and translated into English by Juha Sorva.

The appendices (glossary, Scala reference, FAQ, etc.) are by Juha Sorva unless otherwise specified on the page.

The automatic assessment of the assignments has been developed by: (in alphabetical order) Riku Autio, Nikolas Drosdek, Joonatan Honkamaa, Jaakko Kantojärvi, Niklas Kröger, Teemu Lehtinen, Strasdosky Otewa, Timi Seppälä, Teemu Sirkiä, and Aleksi Vartiainen.

The illustrations at the top of each chapter, and the similar drawings elsewhere in the ebook, are the work of Christina Lassheikki.

The animations that detail the execution Scala programs have been designed by Juha Sorva and Teemu Sirkiä. Teemu Sirkiä and Riku Autio did the technical implementation, relying on Teemu’s Jsvee and Kelmu toolkits.

The other diagrams and interactive presentations in the ebook are by Juha Sorva.

The O1Library software has been developed by Aleksi Lukkarinen and Juha Sorva. Several of its key components are built upon Aleksi’s SMCL library.

The pedagogy of using O1Library for simple graphical programming (such as Pic) is inspired by the textbooks How to Design Programs by Flatt, Felleisen, Findler, and Krishnamurthi and Picturing Programs by Stephen Bloch.

The course platform A+ was originally created at Aalto’s LeTech research group as a student project. The open-source project is now shepherded by the Computer Science department’s edu-tech team and hosted by the department’s IT services. Markku Riekkinen is the current lead developer; dozens of Aalto students and others have also contributed.

The A+ Courses plugin, which supports A+ and O1 in IntelliJ IDEA, is another open-source project. It was created by Nikolai Denissov, Olli Kiljunen, and Nikolas Drosdek with input from Juha Sorva, Otto Seppälä, Arto Hellas, and others.

For O1’s current teaching staff, please see Chapter 1.1.

Additional credits for this page

The rainfall assignment is a version of a classic programming problem by Elliot Soloway.

The sentiment-analyzer assignment is an adaptation of a programming assignment by Eric D. Manley and Timothy M. Urness, which is in turn based on a programming contest on the web site Kaggle.

The song lyrics are from a Bruno Mars song.

a drop of ink
Posting submission...