Stream Summary Statistics

by Horatiu Dan

Context

In order to be able to leverage various capabilities of the Java Streams, one shall first understand two general concepts – the stream and the stream pipeline. A Stream in Java is a sequential flow of data. A stream pipeline on the other hand, represents a series of steps applied to data, series that ultimately produce a result.

My family and I recently visited the Legoland Resort in Germany – a great place by the way – and there, among other attractions, we had the chance to observe in detail a sample of the brick building process. Briefly, everything starts from the granular plastic that is melted, modeled accordingly, assembled, painted, stenciled if needed and packed up in bags and boxes. All the steps are part of an assembly factory pipeline.

What is worth mentioning is the fact that the next step cannot be done until the previous one has completed and also that the number of steps is finite. Moreover, at every step, each Lego element is touched to perform the corresponding operation and then it moves only forward, never backwards, so that the next step is done. The same applies to Java streams.

In functional programming, the steps are called stream operations and they are of three categories – one that starts the job (source), one that ends it and produces the result (terminal) and a couple of intermediate ones in between.

As a last consideration it’s worth mentioning the intermediate operations have the ability to transform the stream into another one, but are never run until the terminal operation runs (they are lazy evaluated). Finally, once the result is produced and the initial scope achieved, the stream is no longer valid.

Abstract

Having as starting point the fact that in case of Java Streams once the terminal stream operation is done, the stream is no longer valid, this article aims to present a way of computing multiple operations at once through only one stream traversal. It is accomplished by leveraging the Java summary statistics objects (in particular IntSummaryStatistics) that reside since version 1.8.

Proof of Concept

The small project built especially to show case the statistics computation uses the following:

  • Java 17
  • Maven 3.6.3
  • JUnit Jupiter Engine v.5.9.3

As domain, there is one straight forward entity – a parent.

public record Parent(String name, int age) { }

It is modeled by two attributes – the name and its age. While the name is present only for being able to distinguish the parents, the age is the one of interest here.

The purpose is to be able to compute a few age statistics on a set of parents, that is:

  • the total sample count
  • the ages of the youngest and the oldest parent
  • the age range of the group
  • the average age
  • the total number of years the parents accumulate.

The results are encapsulated into a ParentStats structure, represented as a record as well.

public record ParentStats(long count,
                          int youngest,
                          int oldest,
                          int ageRange,
                          double averageAge,
                          long totalYearsOfAge) { }

In order to accomplish this, an interface is defined.

public interface Service {

    ParentStats getStats(List<Parent> parents);
}

For now, it has only one method that receives an input a list of Parents and provides as output the desired statistics.

Initial Implementation

As the problem is trivial, an initial and imperative implementation of the service might be as below:

public class InitialService implements Service {

    @Override
    public ParentStats getStats(List<Parent> parents) {
        int count = parents.size();
        int min = Integer.MAX_VALUE;
        int max = 0;
        int sum = 0;
        for (Parent human : parents) {
            int age = human.age();
            if (age < min) {
                min = age;
            }
            if (age > max) {
                max = age;
            }
            sum += age;
        }

        return new ParentStats(count, min, max, max - min, (double) sum/count, sum);
    }
}

The code looks clear, but it seems too focused on the how rather than on the what, thus the problem seems to get lost in the implementation and the code hard to read.

As the functional style and streams are already part of every Java developer’s practices, most probably the next service implementation would be chosen.

public class StreamService implements Service {

    @Override
    public ParentStats getStats(List<Parent> parents) {
        int count = parents.size();

        int min = parents.stream()
                .mapToInt(Parent::age)
                .min()
                .orElseThrow(RuntimeException::new);

        int max = parents.stream()
                .mapToInt(Parent::age)
                .max()
                .orElseThrow(RuntimeException::new);

        int sum = parents.stream()
                .mapToInt(Parent::age)
                .sum();

        return new ParentStats(count, min, max, max - min, (double) sum/count, sum);
    }
}

The code is more readable now, the downside though is the stream traversal redundancy for computing all the desired stats – three times in this particular case. As stated in the beginning of the article, once the terminal operation is done – min, max, sum – the stream is no longer valid. It would be convenient to be able to compute the aimed statistics without having to loop the list of parents multiple times.

Summary Statistics Implementation

In Java, there is a series of objects called SummaryStatistics which come as different types – IntSummaryStatistics, LongSummaryStatistics, DoubleSummaryStatistics.

According to the JavaDoc, IntSummaryStatistics is “a state object for collecting statistics such as count, min, max, sum and average. The class is designed to work with (though does not require) streams”. [Resource 1]

It is a good candidate for the initial purpose, thus the following implementation of the Service seems the preferred one.

public class StatsService implements Service {

    @Override
    public ParentStats getStats(List<Parent> parents) {
        IntSummaryStatistics stats = parents.stream()
                .mapToInt(Parent::age)
                .summaryStatistics();

        return new ParentStats(stats.getCount(),
                stats.getMin(),
                stats.getMax(),
                stats.getMax() - stats.getMin(),
                stats.getAverage(),
                stats.getSum());
    }
}

There is only one stream of parents, the statistics get computed and the code is way readable this time.

In order to check all three implementations, the following abstract base unit test is used.

abstract class ServiceTest {

    private Service service;

    private List<Parent> mothers;
    private List<Parent> fathers;
    private List<Parent> parents;

    protected abstract Service setupService();

    @BeforeEach
    void setup() {
        service = setupService();

        mothers = IntStream.rangeClosed(1, 3)
                .mapToObj(i -> new Parent("Mother" + i, i + 30))
                .collect(Collectors.toList());

        fathers = IntStream.rangeClosed(4, 6)
                .mapToObj(i -> new Parent("Father" + i, i + 30))
                .collect(Collectors.toList());

        parents = new ArrayList<>(mothers);
        parents.addAll(fathers);
    }

    private void assertParentStats(ParentStats stats) {
        Assertions.assertNotNull(stats);
        Assertions.assertEquals(6, stats.count());
        Assertions.assertEquals(31, stats.youngest());
        Assertions.assertEquals(36, stats.oldest());
        Assertions.assertEquals(5, stats.ageRange());

        final int sum = 31 + 32 + 33 + 34 + 35 + 36;

        Assertions.assertEquals((double) sum/6, stats.averageAge());
        Assertions.assertEquals(sum, stats.totalYearsOfAge());
    }

    @Test
    void getStats() {
        final ParentStats stats = service.getStats(parents);
        assertParentStats(stats);
    }
}

As the stats are computed for all the parents, the mothers and fathers are first put together in the same parents list (we will see later why there were two lists in the first place).

The particular unit-test for each implementation is trivial – it sets up the service instance.

class StatsServiceTest extends ServiceTest {

    @Override
    protected Service setupService() {
        return new StatsService();
    }
}

Combining Statistics

In addition to the already used methods – getMin(), getMax(), getCount(), getSum(), getAverage()IntSummaryStatistics provides a way to combine the state of another similar object into the current one.

void combine(IntSummaryStatistics other)

As we saw in the above unit-test, initially there are two source lists – mothers and fathers. It would be convenient to be able to directly compute the statistics, without first merging them.

In order to accomplish this, the Service is enriched with the following method.

default ParentStats getCombinedStats(List<Parent> mothers, List<Parent> fathers) {
	final List<Parent> parents = new ArrayList<>(mothers);
	parents.addAll(fathers);
	return getStats(parents);
}

The first two implementations – InitialService and StreamService – are not of interest here, thus a default implementation was provided for convenince. It is overwritten only by the StatsService.

@Override
public ParentStats getCombinedStats(List<Parent> mothers, List<Parent> fathers) {
	Collector<Parent, ?, IntSummaryStatistics> collector = Collectors.summarizingInt(Parent::age);

	IntSummaryStatistics stats = mothers.stream().collect(collector);
	stats.combine(fathers.stream().collect(collector));

	return new ParentStats(stats.getCount(),
			stats.getMin(),
			stats.getMax(),
			stats.getMax() - stats.getMin(),
			stats.getAverage(),
			stats.getSum());
}

By leveraging the combine() method, the statistics can be merged directly as different source lists are available.

The corresponding unit test is straight-forward.

@Test
void getCombinedStats() {
	final ParentStats stats = service.getCombinedStats(mothers, fathers);
	assertParentStats(stats);
}

Having seen the above Collector, the initial getStats() method may be written even more briefly.

@Override
public ParentStats getStats(List<Parent> parents) {
	IntSummaryStatistics stats = parents.stream()
			.collect(Collectors.summarizingInt(Parent::age));

	return new ParentStats(stats.getCount(),
			stats.getMin(),
			stats.getMax(),
			stats.getMax() - stats.getMin(),
			stats.getAverage(),
			stats.getSum());
}

Conclusion

Depending on the used data types, IntSummaryStatistics, LongSummaryStatistics or DoubleSummaryStatistics are convenient out-of-the-box structures that one can use to quickly compute simple statistics and focus on writing more readable and maintainable code.

Resources

  1. IntSummaryStatistics JavaDoc
  2. Source code for the sample POC
  3. The picture was taken at Legoland Resort, Germany

Leave a comment