Skip to content

Blog

Netty Websocket SSL

This is a small guide on how to create a Netty Websocket client/server application, communicating over SSL(wss). This guide showcases how to use JKS keystores/truststores, as they are the most common way of storing private keys and certificates in the Java world.

This guide will show:

  • How to create a private key along with a self signed certificate using Java keytool
  • How to create a truststore containing the self signed certificate. This certificate will be used by the websocket client for 'trusting' the websocket server upon SSL connection
  • A simple Netty websocket server example, exposing an SSL connection, using the private key generated in the above step
  • A simple Netty websocket client example, establishing an SSL connection to the server, using the JKS truststore created in the above step

Creating Our Keystore/Truststore

Java Keytool is a nice and easy to use utility, shipped with the JDK, for performing various cryptographic tasks (i.e. generating keys, generating and manipulating certificates etc). The official documentation is pretty easy to follow.

For our example, we need to generate a public/private key pair along with a self signed certificate. This can be done with the below command, the output of which is a JKS store, containing our private key and the self signed certificate.

keytool -genkeypair -alias TestKey -keyalg RSA -keysize 2048 -keystore TestKeystore.jks -storetype JKS

The above JKS keystore will be used by our Netty websocket server to perform the SSL handshake.

Once we have the keystore, we can actually extract the self signed certificate and import it into a JKS trustore. This truststore will be used by our Websocket client, determining which certificates to trust. If the client does not trust the certificate presented by the server the SSL handshake will not be successful.

The command to extract the certificate into a .cert file is:

keytool -exportcert -rfc -alias TestKey -keystore TestKeystore.jks -storepass changeit -storetype JKS -file TestCert.cert

And the command to import that exported certificate into a JKS truststore is:

keytool -importcert -file TestCert.cert -keystore TestTruststore.jks -storepass changeit -storetype JKS

Example Netty Application

Now that we have both the keystore (to be used by the Server) and the truststore (to be used by the client) we can create our demo Netty client/server applications.

Effectively, all we need to do is adding an SSLHandler in the ChannelPipeline. This SSLHandler needs to reference the SSL context created by the respective JKS keystore/truststore.

Server

import io.netty.bootstrap.ServerBootstrap;
import io.netty.channel.Channel;
import io.netty.channel.ChannelHandlerContext;
import io.netty.channel.ChannelInitializer;
import io.netty.channel.ChannelPipeline;
import io.netty.channel.SimpleChannelInboundHandler;
import io.netty.channel.nio.NioEventLoopGroup;
import io.netty.channel.socket.nio.NioServerSocketChannel;
import io.netty.handler.codec.http.HttpObjectAggregator;
import io.netty.handler.codec.http.HttpServerCodec;
import io.netty.handler.codec.http.websocketx.TextWebSocketFrame;
import io.netty.handler.codec.http.websocketx.WebSocketServerProtocolHandler;
import io.netty.handler.logging.LogLevel;
import io.netty.handler.logging.LoggingHandler;
import io.netty.handler.ssl.SslContext;
import io.netty.handler.ssl.SslContextBuilder;
import java.security.KeyStore;
import javax.net.ssl.KeyManagerFactory;
import javax.net.ssl.SSLContext;

public class NettyWSServer {

    public void start() {

        final NioEventLoopGroup bossGroup = new NioEventLoopGroup(1);
        final NioEventLoopGroup worker = new NioEventLoopGroup(1);
        final ServerBootstrap wsServer = new ServerBootstrap()
            .group(bossGroup, worker)
            .channel(NioServerSocketChannel.class)
            .handler(new LoggingHandler(LogLevel.INFO))
            .childHandler(new ChannelInitializer<Channel>() {
                @Override
                protected void initChannel(final Channel channel) throws Exception {
                    ChannelPipeline pipeline = channel.pipeline();

                    pipeline.addLast(createSSLContext().newHandler(channel.alloc()));

                    pipeline.addLast(new HttpServerCodec());
                    pipeline.addLast(new HttpObjectAggregator(64_000));
                    pipeline.addLast(new WebSocketServerProtocolHandler("/"));

                    pipeline.addLast(new SimpleChannelInboundHandler<TextWebSocketFrame>() {

                        @Override
                        protected void channelRead0(ChannelHandlerContext ctx, TextWebSocketFrame msg) throws Exception {
                            System.out.println("Message=" + msg.text());
                            ctx.writeAndFlush(new TextWebSocketFrame(msg.text() + " back"));
                        }
                    });
                }
            });

        System.out.println("WS Server started");
        wsServer.bind(10_000)
            .channel().closeFuture().syncUninterruptibly();
    }

    private SslContext createSSLContext() throws Exception{
        KeyStore keystore = KeyStore.getInstance("JKS");
        keystore.load(NettyWSServer.class.getResourceAsStream("/TestKeystore.jks"), "changeit".toCharArray());

        KeyManagerFactory keyManagerFactory = KeyManagerFactory.getInstance(KeyManagerFactory.getDefaultAlgorithm());
        keyManagerFactory.init(keystore, "changeit".toCharArray());

        SSLContext sslContext = SSLContext.getInstance("TLS");
        sslContext.init(keyManagerFactory.getKeyManagers(), null, null);

        return SslContextBuilder.forServer(keyManagerFactory).build();
    }

    public static void main(String[] args) {
        new NettyWSServer().start();
    }
}

Client

import io.netty.bootstrap.Bootstrap;
import io.netty.channel.ChannelHandlerContext;
import io.netty.channel.ChannelInitializer;
import io.netty.channel.ChannelPipeline;
import io.netty.channel.EventLoopGroup;
import io.netty.channel.SimpleChannelInboundHandler;
import io.netty.channel.nio.NioEventLoopGroup;
import io.netty.channel.socket.nio.NioSocketChannel;
import io.netty.handler.codec.http.DefaultHttpHeaders;
import io.netty.handler.codec.http.HttpClientCodec;
import io.netty.handler.codec.http.HttpObjectAggregator;
import io.netty.handler.codec.http.websocketx.TextWebSocketFrame;
import io.netty.handler.codec.http.websocketx.WebSocketClientHandshaker13;
import io.netty.handler.codec.http.websocketx.WebSocketClientProtocolHandler;
import io.netty.handler.codec.http.websocketx.WebSocketVersion;
import io.netty.handler.ssl.SslContextBuilder;
import java.net.URI;
import java.security.KeyStore;
import java.util.Objects;
import javax.net.ssl.TrustManagerFactory;

public class NettyWSClient {

    public void start() {

        final EventLoopGroup bossLoop = new NioEventLoopGroup(1);
        Bootstrap client = new Bootstrap()
            .group(bossLoop)
            .channel(NioSocketChannel.class)
            .handler(new ChannelInitializer<NioSocketChannel>() {
                @Override
                protected void initChannel(NioSocketChannel channel) throws Exception {
                    ChannelPipeline pipeline = channel.pipeline();

                    KeyStore truststore = KeyStore.getInstance("JKS");
                    truststore.load(NettyWSClient.class.getResourceAsStream("/TestTruststore.jks"), "changeit".toCharArray());
                    TrustManagerFactory trustManagerFactory = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());
                    trustManagerFactory.init(truststore);

                    pipeline.addLast(SslContextBuilder.forClient().trustManager(trustManagerFactory).build().newHandler(channel.alloc()));

                    pipeline.addLast(new HttpClientCodec(512, 512, 512));
                    pipeline.addLast(new HttpObjectAggregator(16_384));
                    final String url = "wss://localhost:10000";
                    final WebSocketClientHandshaker13 wsHandshaker = new WebSocketClientHandshaker13(new URI(url),
                        WebSocketVersion.V13, "", false, new DefaultHttpHeaders(false), 64_000);
                    pipeline.addLast(new WebSocketClientProtocolHandler(wsHandshaker));

                    pipeline.addLast(new SimpleChannelInboundHandler<TextWebSocketFrame>() {

                        @Override
                        public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception {
                            if (evt instanceof WebSocketClientProtocolHandler.ClientHandshakeStateEvent) {
                                WebSocketClientProtocolHandler.ClientHandshakeStateEvent handshakeStateEvent = (WebSocketClientProtocolHandler.ClientHandshakeStateEvent) evt;
                                switch (handshakeStateEvent) {
                                    case HANDSHAKE_COMPLETE:
                                        System.out.println("Handshake completed. Sending Hello World");
                                        ctx.writeAndFlush(new TextWebSocketFrame("Hello World"));
                                        break;
                                }
                            }
                        }

                        @Override
                        protected void channelRead0(final ChannelHandlerContext ctx, TextWebSocketFrame msg) throws Exception {
                            System.out.println("Message=" + msg.text());
                        }
                    });
                }
            });
        client.connect("localhost", 10_000).channel().closeFuture().syncUninterruptibly();
    }

    public static void main(String[] args) {
        new NettyWSClient().start();
    }
}

Building A Logo Turtle App With Antlr And JavaFX

This post is about building a very simple application, implementing a subset of the Logo Programming Language along with a UI that visualizes the Logo programms entered by users.
The technologies used were Antlr for creating a parser for a subset of the Logo rules and also JavaFX for creating a UI allowing users to enter Logo programs and providing a drawing visualization of those programs.

The Logo Language

Logo is an educational language, mainly targeted to younger ages. It is a language that I personally had some interactions with back in the junior high school days. It effectively provides a grammar of movement rules (i.e. forward, back, right 90o) along with some control flow (i.e. repeat) allowing the user to produce a set of commands. Those commands, coupled with a visualization software can draw vector graphics, or coupled with robotic devices can move the robot around.

It used to be a nice language for educating kids on programming, nowadays however there are better more advanced ones like Scratch. I picked Logo however, for its simple grammar as my main goal was to refresh some Antlr knowledge, rather than build a real application.

Antlr

Antlr is a great tool for parsing structured text (i.e. imagine regular expressions on steroids). It is a tool that can be used to parse grammatical rules and build applications on top of its features. A common approach is to build custom DSLs using Antlr. I will not describe Antlr in length as the official website has lots of good documentation and also I am no expert on Antlr. Additionally, the book Definitive-ANTLR-4-Reference written by its creator is a good resource.

In this application, effectively I have defined my own set of Logo rules using Antlr's grammar and I relied on Antlr's parsing capabilities to evaluate the Logo programs and give me callbacks of the Logo commands encountered.

JavaFX

Most people would be familiar with JavaFX. Effectively is the next attempt after Java Swing on creating the machinery for building (modern?) UIs using Java. My UI skills are pretty bad, hence I wanted something to force me build a UI. I picked JavaFX, instead of something more standard like HTML5+JS Framework, as I had done some Java Swing in the past and wanted to try JavaFX out of curiosity mainly.

Even though JavaFX is very feature rich and the programming model resembles part of Java Swing and part of C# WPF (which i was a bit familiar back in 2011) I was not impressed by it. It felt cumbersome in ways that I thought the whole programming model was getting into my way, maybe because I am not familiar with it, also maybe because is just not a great programming model justifying the lack of widespread adoption.

Defining The Antlr Grammar

As mentioned above, Antlr needs a grammar definition, which consist of parser and lexer rules. The lexer rules are used to extract tokens out of the text, and the parser rules for extracting meaningful statements.

I decided to only go with a small subset of the Logo features, so the below would be supported:

  • Moving forward
  • Moving backwards
  • Turning left/right
  • Allowing for pen up/down functionality, meaning if pen is up no drawing should appear even if the 'turtle' moves around

Those rules, translated into an Antlr grammar look like Logo.g4

Someone can notice that the above grammar just defines the keyworks (i.e. forward, back, right etc) as lexer rules (a.k.a tokens) and programmar expressions (i.e. forward 50) as parser rules. In the application layer, Antlr generates stubs of listeners for the parser rules, which can be implemented and the user gets callbacks on those rules. Then users can write their logic on top of that.

It is easy to see how helpful Antlr is, doing all the heavylifting for the user. Someone just need to extend the already generated listener, which propagates the events to the user's code.

Wiring Parser Callbacks

As we are mainly interested in the grammar rules that define Logo actions, we can only implement those callbacks. The class that deals with the callback can be made UI agnostic and act as a driver to the underlying implementation. For example we could have various implementations of how to visualize the Logo program:

  • A JavaFX UI
  • A Swing UI
  • A plain standard out program

The below implementation deals with that

public class LogoDriver extends LogoBaseListener {

    private final TurtlePainter painter;

    public LogoDriver(TurtlePainter painter) {
        this.painter = painter;
    }

    @Override
    public void exitForward(final ForwardContext ctx) {
        this.painter.forward(Integer.parseInt(ctx.getChild(1).getText()));
    }

    @Override
    public void exitBack(final BackContext ctx) {
        this.painter.back(Integer.parseInt(ctx.getChild(1).getText()));
    }

    @Override
    public void exitRight(final RightContext ctx) {
        this.painter.right(Integer.parseInt(ctx.getChild(1).getText()));
    }

    @Override
    public void exitLeft(final LeftContext ctx) {
        this.painter.left(Integer.parseInt(ctx.getChild(1).getText()));
    }

    @Override
    public void exitSet(final SetContext ctx) {
        final String[] point = ctx.POINT().getText().split(",");
        final int x = Integer.parseInt(point[0]);
        final int y = Integer.parseInt(point[1]);
        this.painter.set(x, y);
    }

    @Override
    public void exitPenUp(final PenUpContext ctx) {
        this.painter.penUp();
    }

    @Override
    public void exitPenDown(final PenDownContext ctx) {
        this.painter.penDown();
    }

    @Override
    public void exitClearscreen(ClearscreenContext ctx) {
        this.painter.cls();
    }

    @Override
    public void exitResetAngle(ResetAngleContext ctx) {
        this.painter.resetAngle();
    }

    @Override
    public void exitProg(ProgContext ctx) {
        this.painter.finish();
    }
}

The TurtlePainter can be anything, even a program that records the program's commands and asserts them, like a JUnit spy.

The JavaFX UI

In our case, the TurtlePainter is a class that translates the commands into JavaFX constructs and delegates to the UI thread to draw those constrcuts. For example the implementation for the forward command looks like:

    @Override
    public void forward(int points) {
        JavaFXThreadHelper.runOrDefer(() -> {
            final double radian = this.toRadian(this.direction);
            final double x = this.turtle.getCenterX() + points * Math.cos(radian);
            final double y = this.turtle.getCenterY() - points * Math.sin(radian);

            this.validateBounds(x, y);

            this.moveTurtle(x, y);
        });
    }

    private void moveTurtle(final double x, final double y) {
        JavaFXThreadHelper.runOrDefer(() -> {

            final Path path = new Path();
            path.getElements().add(new MoveTo(this.turtle.getCenterX(), this.turtle.getCenterY()));
            path.getElements().add(new LineTo(x, y));

            final PathTransition pathTransition = new PathTransition();
            pathTransition.setDuration(Duration.millis(this.animationDurationMs));
            pathTransition.setPath(path);
            pathTransition.setNode(this.turtle);

            if (this.isPenDown) {
                final Line line = new Line(this.turtle.getCenterX(), this.turtle.getCenterY(), x, y);
                pathTransition.setOnFinished(onFinished -> this.canvas.getChildren().add(line));
            }

            animation.getChildren().add(pathTransition);

            this.paintTurtle(x, y);
        });
    }

Effectively drawing a line to the UI.

A simple Logo program that draws "HELLO WORLD" in the screen can be found here. The result for this one would look like:

The Source Code

Source code is checked into github.

Quite a few enhancements can be made, both on the UI side, but also at the language level side:

  • Implement Logo flow control (i.e. loops)
  • Make the turtle, an actual turtle image, by also showing its facing direction
  • etc…

Feel free to fork or send a PR for any addition :)

Brushing Up My C. Building A Unix Domain Socket Client/Server (PART II)

I described in this previous blog post how to build a simplistic Unix Domain Socket client/server application.
The disadvantage with that approach is that the server can only handle one connection at a time (i.e. is not concurrent).

This blog post explains how this can be improved by using mechanisms like select(), epoll(), kqueue() etc.
Effectively all these mechanisms allow for monitoring multiple file descriptors and be called back when one or multiple of those file descriptors have data so that an action can be invoked (i.e. read, write etc).
The main differences among those are characteristics like:

  • Synchronous vs Asynchronous paradigms
  • Underlying data structures in the internals of those system calls, which play an important role on performance
  • Platforms/OS specific, as not every OS supports all the above. Some are platform agnonistic (i.e. select()), some are platform specific (i.e. epoll() is only implemented on Linux)

A superb blog post to understand the differences is this one by Julia Evans

select()

I tried to just enhance the server part of the previous blog post and I went with the select() option.
The select() system call is simpler from epoll() or kqueue() and it effectively allows for registering a number of file descriptors which are monitored for I/O events. On calling select() the thread blocks and it only unblocks when one or more file descriptors have I/O data.
The file descriptors have to manually be registered on an fd_set, which in turn is passed in the select() call. The below macros can be used to manipulate the fd_set:

  • void FD_ZERO(fd_set *set): Initialize an fd_set
  • void FD_SET(int fd, fd_set *set): Add a file descriptor to an fd_set
  • void FD_CLR(int fd, fd_set *set): Remove a file descripro from the fd_set
  • int FD_ISSET(int fd, fd_set *set): Check if a specific file descriptor, part of the fd_set is ready with I/O data

The main caveat with select() is that on every call the fd_set is cleared from the file descriptors that do not have any I/O data on that cycle, hence the developer has to manually re-register all the file descriptors again, which is also descripted in the select() documentation

Note well: Upon return, each of the file descriptor sets is modified in place to indicate which file descriptors are currently "ready". Thus, if using select() within a loop, the sets must be reinitialized before each call.

Having said that, the server.c file now looks like:

#include "stdlib.h"
#include "stdio.h"
#include "string.h"
#include "stddef.h"

#include <unistd.h>
#include <sys/socket.h>
#include <sys/types.h> 
#include <sys/un.h>
#include <netinet/in.h>
#include "sys/syscall.h"
#include <sys/select.h>
#include <errno.h>

#include "af_unix_sockets_common.h"

int pumpData(int fd);
int cleanupConnections(int *connections, int idx);

/*
* Open a `AF_UNIX` socket on the `path` specified. `bind()` to that address, `listen()` for incoming connections and `accept()`. Finally, wait for input from the socket and print 
* that to the `stdout`. When one connection is closed, wait for the next one.
*/
void server(char *path) {
    printf("Starting AF_UNIX server on Path=%s\n", path);
    AFUnixAddress *domainSocketAddress = open_af_unix_socket(path);

    int hasBind = bind(domainSocketAddress->fd, (struct sockaddr *)domainSocketAddress->address, sizeof(struct sockaddr));
    if(hasBind == -1){
        fprintf(stderr, "Failed to bind AF_UNIX socket on Path=%s. ErrorNo=%d\n", path, errno);
        cleanup(domainSocketAddress->fd, path);
        exit(errno);
    }

    int isListening = listen(domainSocketAddress->fd,  10);
    if(isListening == -1) {
        fprintf(stderr, "Failed to listen to AF_UNIX socket on Path=%s. ErrorNo=%d\n", path, errno);
        cleanup(domainSocketAddress->fd, path);
        exit(errno);
    }

    fd_set readfds;
    int maxFD = domainSocketAddress->fd;
    FD_ZERO(&readfds);
    FD_SET(domainSocketAddress->fd, &readfds);

    int openConnections[FD_SETSIZE];
    int closedConnections[FD_SETSIZE] = {0};// indices to openConnections that have clo
    int nextIdx = 0;

    fprintf(stdout, "Start accepting connections on Path=%s\n", path);
    while(TRUE) {
        int retVal = select(maxFD + 1, &readfds, NULL, NULL, NULL);
        if(FD_ISSET(domainSocketAddress->fd, &readfds)) {

            int connFd = accept(domainSocketAddress->fd, NULL, NULL);
            if(connFd == -1) {
                fprintf(stderr, "Error while accepting connection. Error=%s, ErrorNo=%d\n", strerror(errno), errno);
                cleanup(domainSocketAddress->fd, path);
                exit(errno);
            }
            fprintf(stdout, "New AF_UNIX connection added\n");

            openConnections[nextIdx++] = connFd;
            maxFD = maxFD >= connFd ? maxFD : connFd;
            FD_SET(connFd, &readfds);
        } else {
            for(int i = 0; i < nextIdx;i ++) {
                if(FD_ISSET(openConnections[i], &readfds)) {

                    if(!pumpData(openConnections[i])){
                        FD_CLR(openConnections[i], &readfds);
                        openConnections[i] = -1;// denotes that connection has closed
                    }
                }
            }

            nextIdx = cleanupConnections(openConnections, nextIdx);
        }

        // re-add all active FDs to fd_set
        FD_SET(domainSocketAddress->fd, &readfds);
        for(int i = 0; i < nextIdx;i ++) {
            FD_SET(openConnections[i], &readfds);
        }
    }

    cleanup(domainSocketAddress->fd, path);
}

int pumpData(int connFd) {
    char buf[BUFSIZ];
    int bytes = read(connFd, buf, BUFSIZ);
    if(bytes <= 0) {
        fprintf(stdout, "Connection closed\n");
        return FALSE;
    }
    write(1, buf, bytes);
    return TRUE;
}

int cleanupConnections(int *connections, int idx) {
    int temp[idx];
    int next = 0;
    for(int i = 0; i < idx;i++) {
        if(connections[i] != -1) {
            temp[next++] = connections[i];
        }
    }

    memcpy(connections, temp, sizeof(temp));
    return next;
}

The Changes

The few changes that worth mentioning are the below:

  • We first registered the AF_UNIX sockets file descriptor on the fd_set that is passed into the select() call.
  • On every call to select(), the first check to be done is that if the socket file descriptor has I/O data, which means a new connection. If so the server accept() that connection
if(FD_ISSET(domainSocketAddress->fd, &readfds)) {
    int connFd = accept(domainSocketAddress->fd, NULL, NULL);
    ...
  • After accepting a connection, that connection's file descriptor has to be added to the fd_set so that it can be monitored for I/O events
FD_SET(connFd, &readfds);
  • For every open connection, the program checks if the corresponding file descriptor has I/O data, and if so the server reads those data. Worth noting that when a connection closes, this also means an I/O signal, hence the program needs to check and remove the closed file descriptor from the monitoring fd_set
if(FD_ISSET(openConnections[i], &readfds)) {
    if(!pumpData(openConnections[i])){
        FD_CLR(openConnections[i], &readfds);
        openConnections[i] = -1;// denotes that connection has closed
    }
}
  • Finally, as mentioned above, after select() returns it will only contain file descriptors that have data on the fd_set. Any previously added file descriptors that did not have I/O data on that cycle are removed, hence needs to be re-added. Luckily, according to the select() documentation there is no harm trying to re-set a file descriptor that is already in the fd_set hence we just loop over the known file descriptors and re-add them all on the fd_set

FD_SET() This macro adds the file descriptor fd to set. Adding a file descriptor that is already present in the set is a no-op, and does not produce an error.

// re-add all active FDs to fd_set
FD_SET(domainSocketAddress->fd, &readfds);
for(int i = 0; i < nextIdx;i ++) {
    FD_SET(openConnections[i], &readfds);
}

Conclusion

The changes needed to allow for multiplexing of different connections were minimal and did not radically affect the programs logic. Someone can take this example and enhance it further. Some suggestions would be:

  • Try epoll() instead of select()
  • Instead of just reading what the client has sent and printing it out to the console, broadcast the message to all clients connected at that time

Brushing Up My C. Building A Unix Domain Socket Client/Server (Part I)

I haven't really done much C in my professional career. There were a couple of times that I had to look into some C code, but not really create any kind of system in C. Most of my interaction with the C programming language dates back to university times.
I always liked C though. For its simplicity as a language, for its small surface as a language (i.e. not bloated with tons of features) and finally for the perfect, artistic, abstraction it provides on top of the hardware, which is so close to the metal but at the same time hiding the complexities and providing just about the right amount of abstractions.

For all the above and thanks to the lockdown and the plenty of me-time that I now have, I decided to revise C and try to put some of its features in practice by gradually building a very simple application.

Revising Using a Book

I didn't want to follow any tutorials or knowledge in Google, hence I picked up a book in order to revise. Even though there are a couple of great books for C out there, for me it was a no brainer to actually revise one of the best books in my library The C Programming Language by Brian W. Kernighan and Dennis M. Ritchie.
This book is written back in 1988, but personally I find it to be one of the best technical books I ever read. I read the entire book in about 2 weeks, and in parallel I was trying to do some of the plenty exercise that each chapter has. By doing so, I refreshed some of my knowledge in C and got a fresh overview of its features as a language.

Deciding On A Simple Application

After finishing the book I decided it would be nice to put some of the concepts in action and build something small, trying to use some of the features I just revised. I wanted something that wouldn't take me much time (as it would mainly be a playground, rather than a side project), but at the same time also give me some knowledge.
I was always interested and fairly familiar with network servers. I have an understanding of the underlyling kernel space functions that take place in socket programming, but I have always been using abstractions on top of this, mainly through Java frameworks. So I decided to build something around that concept, and as I wanted to keep it fairly simple and because I hadn't really use this feature in the past, I concluded building a client/server applications using Unix Domain Sockets makes sense and most likely would fullfill most of my requirements.

Simplifying it a lot, a Unix Domain Socket is like any other socket (i.e. internet socket), but it can only be used for inter-process communication on the host it is opened, as it is actually backed by the filesystem.

Rules Of Engagement

In order to actually build this all by myself (I could just google an example and get the solution in 5 minutes), I decided to impose some very basic rules:

  • I could use the book I read as a reference
  • I could use man7.org as a reference for system calls
  • I could not use any other internet resource that provided a ready baked solution
  • The purpose of this application was not to be cross-platform. I was mainly interested making it work in my Mac and effectively in BSD like systems

The Code

I decided to split the code into different source (.c) files:

  • af_unix_sockets.c: The file containing the main() method and basic logic for parsing the command line arguments in order to start a client or a server
  • af_unix_sockets_common.h: A header file containing common definitions and the prototypes for the different methods, that client or server implements and also the defininion of a simple type AFUnixAddress storing a file descriptor and the actual socket address
  • af_unix_sockets_common.c: A source file containing some common methods
  • af_unix_sockets_server.c: The server implementation, to be called by the main method in af_unix_sockets.c
  • af_unix_sockets_client.c: The client implementation, to be called by the main method in af_unix_sockets.c

The Header File

As described above af_unix_sockets_common.h is a header file defining the prototypes of various functions (which I view as the public interface) to be implemented by parts of the system and to be called by other parts.
Additionally, the header defines a type, which I mainly created for Part II of this post, encapsulating the file descriptor of an opened unix domain socket and also its address.

#define TRUE 1
#define FALSE 0
#define CLIENT "client"
#define SERVER "server"

typedef struct af_unix_address {
    int fd;
    struct sockaddr_un *address;
} AFUnixAddress;

AFUnixAddress * open_af_unix_socket(char *);
void cleanup(int fd, char *path);

void server(char *path);

void client(char *path);

The Common File

I wanted to have a common file, just for the shake of it, to be able to export some common functionality that is shared between the server and the client.

#include "stdlib.h"
#include "stdio.h"
#include "string.h"
#include "stddef.h"

#include <sys/socket.h>
#include <sys/un.h>
#include <errno.h>

#include "af_unix_sockets_common.h"

/*
 * @return AFUnixAddress type, containing the address and the file descriptor for the opened unix domain socket  
*/
AFUnixAddress *open_af_unix_socket(char *path) {
    int fd = socket(AF_UNIX, SOCK_STREAM, 0);
    if(fd == -1) {
        fprintf(stderr, "Failed to open AF_UNIX socket. ErrorNo=%d\n", errno);
        cleanup(fd, path);
        exit(errno);
    }

    AFUnixAddress *af_unix_socket = (AFUnixAddress *)malloc(sizeof(AFUnixAddress));
    af_unix_socket->address = (struct sockaddr_un *)malloc(sizeof(struct sockaddr_un));
    af_unix_socket->fd = fd;
    af_unix_socket->address->sun_family = AF_UNIX;
    strcpy(af_unix_socket->address->sun_path, path);
    return af_unix_socket;
}

/*
* Clean up an opened file descriptor opened for 
*/
void cleanup(int fd, char *path) {
    close(fd);

    remove(path);

    if(fd != -1) {
        fprintf(stderr, "Failed to successfully close socket. ErrorNo=%d\n", errno);
    }
}

The common file is very simple, containing just two methods, one that opens a Unix Domain Socket for the specified path that was passed in and returning an AFUnixAddress which is a type defined in the af_unix_sockets_common.h file, containing the actual socket's address and the file descriptor corresponding to that socket.
The file descriptor is needed for later use, to invoke system calls on it. Finally, worth mentioning that the Unix Domain Socket opened is a SOCK_STREAM one, which based on the documentation follows TCP semantics, as opposed to SOCK_DGRAM which follows UDP semantics.

The Client

The client's behavior is defined in its own file and is pretty simplistic. A path is passed in specifying the filepath for the Unix Domain Socket. Then a socket is opened for that path and an invocation to #connect() method, passing in the file descriptor associated with the socket, forces the connection to be established.
Finally, the client reads from stdin and writes that to the socket invoking the write() method.

#include "stdlib.h"
#include "stdio.h"
#include "string.h"
#include "stddef.h"
#include "unistd.h"

#include <sys/socket.h>
#include <sys/types.h> 
#include <sys/un.h>
#include "sys/syscall.h"
#include <errno.h>

#include "af_unix_sockets_common.h"

/*
* AF_UNIX socket client, obtains an ```AFUnixAddress``` by opening the socket to the specified ```path``` and then invoking ```#connect()``` on the socket's file descriptor.
* Finally, the client reads input from ```stdin``` and writes that to the socket.
*/
void client(char *path) {
    fprintf(stdout, "Starting AF_UNIX client on Path=%s\n", path);
    AFUnixAddress *domainSocketAddress = open_af_unix_socket(path);
    printf("AF_UNIX client socket on Path=%s opened with fd=%d\n", domainSocketAddress->address->sun_path, domainSocketAddress->fd);

    int isConnected = connect(domainSocketAddress->fd, (struct sockaddr *)domainSocketAddress->address, sizeof(struct sockaddr));
    if(isConnected == -1) {
        fprintf(stderr, "Failed to connect to Path=%s. ErrorNo=%s\n", domainSocketAddress->address->sun_path, strerror(errno));
        cleanup(domainSocketAddress->fd, domainSocketAddress->address->sun_path);
        exit(errno);
    }

    char line[1024];
    while(fgets(line, 1024, stdin) != NULL) {
        int size = 0;
        for(;line[size] != '\0'; size++){
        }
        if(size == 0) {
            break;
        }
        int bytes = write(domainSocketAddress->fd, line, size);
    }

    cleanup(domainSocketAddress->fd, domainSocketAddress->address->sun_path);
    exit(0);
}

The Server

The server follows the same pattern with that of the client. The only difference is the system calls involved in order to bind to the opened socket and start listening. More specifically, after the socket is created
a call to bind() connects the file descriptor with the address of the socket. Then a call to listen() allows the socket to wait
for incoming connections, and finally a call to accept() accepts the first enqued connection request and retrieves a file descriptor for that connection. That file descriptor can
be passed in the read() system call to read incoming bytes. Note, that we had to call accept() because we marked the domain socket as a SOCK_STREAM one,
hence effectively a connection oriented socket.

#include "stdlib.h"
#include "stdio.h"
#include "string.h"
#include "stddef.h"
#include "unistd.h"

#include <sys/socket.h>
#include <sys/types.h> 
#include <sys/un.h>
#include "sys/syscall.h"
#include <errno.h>

#include "af_unix_sockets_common.h"

/*
* Open a ```AF_UNIX``` socket on the ```path``` specified. ```#bind()``` to that address, ```#listen()``` for incoming connections and ```#accept()```. Finally, wait for input from the socket and print 
* that to the ```stdout```. When one connection is closed, wait for the next one.
*/
void server(char *path) {
    printf("Starting AF_UNIX server on Path=%s\n", path);
    AFUnixAddress *domainSocketAddress = open_af_unix_socket(path);

    int hasBind = bind(domainSocketAddress->fd, (struct sockaddr *)domainSocketAddress->address, sizeof(struct sockaddr));
    if(hasBind == -1){
        fprintf(stderr, "Failed to bind AF_UNIX socket on Path=%s. ErrorNo=%d\n", path, errno);
        cleanup(domainSocketAddress->fd, path);
        exit(errno);
    }

    int isListening = listen(domainSocketAddress->fd,  10);
    if(isListening == -1) {
        fprintf(stderr, "Failed to listen to AF_UNIX socket on Path=%s. ErrorNo=%d\n", path, errno);
        cleanup(domainSocketAddress->fd, path);
        exit(errno);
    }

    fprintf(stdout, "Start accepting connections on Path=%s\n", path);
    while(TRUE) {
        int connFd = accept(domainSocketAddress->fd, NULL, NULL);
        if(connFd == -1) {
            fprintf(stderr, "Error while accepting connection. Error=%s, ErrorNo=%d\n", strerror(errno), errno);
            cleanup(domainSocketAddress->fd, path);
            exit(errno);
        }

        char buf[BUFSIZ];
        while(TRUE){
            int bytes = read(connFd, buf, BUFSIZ);
            if(bytes <= 0) {
                fprintf(stdout, "Connection closed\n");
                break;
            }
            write(1, buf, bytes);
        }
    }

    cleanup(domainSocketAddress->fd, path);
}

The Main Method

The file that contains the program's main method, reads the command line arguments, does some rudimentary parsing and based on what was passed creates an AF_UNIX socket server or client.

#include "stdlib.h"
#include "stdio.h"
#include "string.h"
#include "stddef.h"
#include "unistd.h"

#include <sys/socket.h>
#include <sys/types.h> 
#include <sys/un.h>
#include "sys/syscall.h"
#include <errno.h>

#include "af_unix_sockets_common.h"

const static char *USAGE = "Usage af_unix_socket --type [server|client] --path [path]\n";

int main(int argc, char **argv){
    if(argc != 5) {
        fprintf(stderr, "%s", USAGE);
        exit(-1);
    }

    char *type = NULL;
    char *path = NULL;
    char *nextParam;

    int idx=1;
    int expectFlag = TRUE;
    while(idx < 5) {
        if(strstr(&(*argv[idx]), "--") != NULL) {
            if(strcmp("--type", &(*argv[idx])) == 0) {
                nextParam = (char *)malloc(32);
                type = nextParam;
            } else if(strcmp("--path", &(*argv[idx])) == 0) {
                nextParam = (char *)malloc(32);
                path = nextParam;
            } else {
                fprintf(stderr, "%s", USAGE);
                exit(-1);
            }
            expectFlag = FALSE;
        } else {
            if(expectFlag) {
                fprintf(stderr, "Expected flag for positional argument %d. %s", idx, USAGE);
                exit(-1);
            }
            size_t paramSize = sizeof(&(*argv[idx]));
            memcpy(nextParam, &(*argv[idx]), paramSize);
        }
        ++idx;
    }

    if(type == NULL || path == NULL) {
        fprintf(stderr, "%s", USAGE);
        exit(-1);
    }

    fprintf(stdout, "Initializing AF_UNIX for Type=%s, Path=%s\n", type, path);
    if(strcmp(type, SERVER) == 0) {
        if(access(path, F_OK) != -1) {
            fprintf(stdout, "File=%s already exists. Deleting file to be used by AF_UNIX server\n", path);
            if(remove(path) != 0) {
                fprintf(stderr, "Failed to remove existing File=%s. File cannot be used for AF_UNIX server\n", path);
                exit(-1);
            }
        }

        server(path);
    } else if(strcmp(type, CLIENT) == 0) {
        client(path);
    } else {
        fprintf(stderr, "Unknown Type=%s\n", type);
    }
    return 0;
}

Compiling And Running The Program

As now we have all the building blocks, we can actually compile, link and run the program. Based on the operating system someone is running they will need a compatible compiler. Some standard choices are GCC or Clang.
I have personally used Clang as it is the standard on a MacOS system.

Compiling and linking the different files together can plainly be done by:

clang af_unix_sockets_common.c af_unix_sockets.c af_unix_sockets_client.c af_unix_sockets_server.c -o af_unix_sockets

Running the server:

af_unix_sockets --type server --path /tmp/af_unix_socket_example

Running the client:

af_unix_sockets --type client --path /tmp/af_unix_socket_example

Why Part I

This post is named Part I. The main reason behind this is that this program only accepts one connection at a time. A more scalable way of writing this program is by using a selector mechanism (i.e. select(), epoll(), kqueue()) which will allow for connection multiplexing and concurrent handling of those connections.
This will be described in a Part II of this blog and hopefully it will not need many alterations on the code above.

Migrating Modules/Directories Between GIT Repos

Disclaimer: I am not very good at using Git. The vast majority of time I just need to use a tiny subset of its features, and most likely my usage of Git resembles how I was using SVN in the past. Hence this post is actually a collection of resources I found online, which I document here for my own future benefit.

Working in large projects many times it means you have to move code around in different repos. For example, in many cases developers start building an application, maintaining all the code in one repo, but as this application is getting mature, or as parts of the app have been extracted in 'framework' like code there is the need to move those parts in a new repository. It is always beneficial to maintain the history of those modules/directories/file.

I came across such a situation recently and below are the steps I followed to achieve that.

Setup

For clarity, let's assume our setup is as below:

  • SourceRepo is the name of the repository that our code that we want to migrate lives at
  • Dir_I, Dir_II, Dir_III are the directories (and the files, other directories inside those) from the SourceRepo that we need to migrate
  • TargetRepo is the new repository, which exists and we need to migrate the above 3 directories, including their history

Steps

  • Clone the SourceRepo
git clone git@github.com:username/SourceRepo.git
cd SourceRepo
  • Remove the remote origin

This is optional and mainly done as a precaution, so that nothing will be pushed to remote origin by accident

git remote rm origin
  • Filter the commits in all branches that related to the diretories we want to keep

This step will blow away every other directory in the SourceRepo, only keeping the above 3 directories, along with their history.

git filter-branch --index-filter "git rm --cached --quiet -r --ignore-unmatch -- . && git reset $GIT_COMMIT -- Dir_I Dir_II Dir_III" --tag-name-filter cat -- -- all

This step will result in the final form of SourceRepo.

  • Clone the TargetRepo
git clone git@github.com:username/TargetRepo.git
cd TargetRepo
  • Add the SourceRepo as a remote to the TargetRepo
git remote add localSourceRepo ../SourceRepo
  • Fetch the index of the newly added remote localSourceRepo
git fetch localSourceRepo
  • Create a branch out of localSourceRepo

This branch will effectively have all the directories along with their old Git history, that was kept from SourceRepo

git branch temp remotes/localSourceRepo/master
  • Create a branch out of TargetRepo master

This is an optional step, but will help us create an intermediary branch where we will merge changes from SourceRepo and TargetRepo. Once the merging and any conflict resolution is completed (if any), someone can raise a PR to merge this to master

git checkout -b fromMaster
  • Merge SourceRepo branch to TargetRepo branch

In some cases (i.e. you had an identical named dir in TargetRepo) conflicts will arise. Those conflicts can be resolved in this step.

git merge temp
  • Once happy with the merging, a PR can be raised to merge fromMaster branch to master
  • Cleanup

This is an optional step. Most likely someone will be deleting all those temporary clones and dirs that were created. But for the benefit of the reader, we no longer need the localSourceRepo remote that we have added and also the temp and fromMaster (considering it was merged already) brances.

git remote rm localSourceRepo
git branch -D temp
git branch -D fromMaster

Resources

For Loops, Allocations and Escape Analysis

For Java applications in certain domains it is truly important that the creation of objects/garbage stays at a minimum. Those applications usually cannot afford GC pauses, hence they use specific techniques
and methodologies to avoid any garbage creation. One of those techniques has to do with iterating over a collection or an array of items. The preferred way is to use the classic for loop. The enhanced-for loop
is avoided as 'it creates garbage', by using the collection's Iterator under the cover.

In order to prove this point i was playing around with loops as i wanted to better understand the differences and measure the amount of garbage that is been created by using the enhanced-for loop, which
arguably is a better, more intuitive syntax.

Prior on experimenting on this, I had (falsely?) made some assumptions:

  • Using a normal for loop over an array or a collection it does not create any new allocations
  • Using an enhanced-for loop over an array(?) or a collection it does allocate
  • Using an enhanced-for loop over an array or a collection of primitives, by accidentally autoboxing the primitive value, it ends up in a pretty high rate of new objects creation

In order to better understand the differencies and especially the fact that an array does not have an iterator, hence how the enhanced-for loop works, I followed the below steps.

Step 1: Enhanced-for Loop Under The Cover

An enhanced-for loop is just syntactic sugar, but what it actually results into, when used for an array and when used on a collection of items?

The answer to this can be found in the Java Language Specification.

The main two points from the above link are:

If the type of Expression is a subtype of Iterable, then the translation is as follows.
If the type of Expression is a subtype of Iterable for some type argument X, then let I be the type java.util.Iterator; otherwise, let I be the raw type java.util.Iterator.
The enhanced for statement is equivalent to a basic for statement of the form:

for (I #i = Expression.iterator(); #i.hasNext(); ) {
    {VariableModifier} TargetType Identifier =
        (TargetType) #i.next();
    Statement
}

and

Otherwise, the Expression necessarily has an array type, T[].
Let L1 … Lm be the (possibly empty) sequence of labels immediately preceding the enhanced for statement.
The enhanced for statement is equivalent to a basic for statement of the form:

T[] #a = Expression;
L1: L2: ... Lm:
for (int #i = 0; #i < #a.length; #i++) {
    {VariableModifier} TargetType Identifier = #a[#i];
    Statement
}

From the above someone can observe that indeed the Iterator is used on the enhanced-for loop on collections. However, for an array, the enhanced-for loop is just syntactic sugar which is equivalent to a normal for loop.

After understanding how the JVM is actually implemening the enhanced-for loop on different use cases our assumptions have changed:

  • Using a normal for loop over an array or a collection it does NOT create any new allocations
  • Using an enhanced-for loop over an array it does NOT create any new allocations
  • Using an enhanced-for loop over a collection it does allocate
  • Using an enhanced-for loop over an array or a collection of primitives, by accidentally autoboxing the primitive value, it ends up in a pretty high rate of new objects creation

Step 2: Defining The Test

In order to test the different scenarios I have created a very simple test which can be seen here

The test itself is very simple, the main points to notice are:

  • The test creates a static array and a static ArrayList and prepopulates them with 100,000 integers. In the case of the array, those are primitives, but in the case of the collection as we use plain ArrayList those are actuall Integer objects
  • The test executes the different for loop example scenarios 1,000,000 times
  • The memory used is read before the iterations start and is compared throughout the execution (every 100 invocations) of the program in order to determine if the memory profile has changed
  • The test scenarios include:
    • A for loop over an array
    • An enhanced-for loop over an array
    • An enhanced-for loop over an array, by also autoboxing the elements
    • A for loop over a collection
    • An enhanced-for loop over a collection
    • An iterator based for loop over a collection, replicating the behaviour of enhanced-for loop's syntactic sugar

Step 3: Running The Test

We ran the test with the below setup:

  • OS: MacOS Catalina (10.15.3), Core i5 @2.6Hz, 8GB DDR3
  • JDK: openjdk version "13.0.2" 2020-01-14
  • JVM_OPTS: -Xms512M -Xmx512M -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC

We use EpsilonGC in order to eliminate any garbage collection and let the memory to just increase.

When running the test, some scenarios were easy to verify according to our expectations:

  • A for loop over an array or a collection, does not create any allocations
  • An enhanced for loop over an array does not create any allocations
  • An enhanced for loop over an array with autoboxing, it is indeed creating new objects

However, the rest of scenarios and the assumption that an enhanced-for loop over a collection will allocate a new Iterator on every loop could not be proved by running the above test, with the above JVM properties. No matter what
the memory profile was steady. No new allocations were taking place on the heap.

First step of the investigation was to make sure that the byte code indicates that a new object gets created. Below is the bytecode, which can be used to verify that a call to get the iterator is taking place in line 5:

  private static long forEachLoopListIterator();
    Code:
       0: lconst_0
       1: lstore_0
       2: getstatic     #5                  // Field LIST_VALUES:Ljava/util/List;
       5: invokeinterface #9,  1            // InterfaceMethod java/util/List.iterator:()Ljava/util/Iterator;
      10: astore_2
      11: aload_2
      12: invokeinterface #10,  1           // InterfaceMethod java/util/Iterator.hasNext:()Z
      17: ifeq          39
      20: lload_0
      21: aload_2
      22: invokeinterface #11,  1           // InterfaceMethod java/util/Iterator.next:()Ljava/lang/Object;
      27: checkcast     #8                  // class java/lang/Integer
      30: invokevirtual #4                  // Method java/lang/Integer.intValue:()I
      33: i2l
      34: ladd
      35: lstore_0
      36: goto          11
      39: lload_0
      40: lreturn

As we are using an ArrayList the next step is to see what the call to #iterator() is doing. It is indeed creating a new iterator object as can be seen in ArrayList source code

    public Iterator<E> iterator() {
        return new Itr();
    }

Looking at the above, the results that we are getting with a steady memory profile do not make much sense. Something else is definitely going on. It might be that the test is wrong (i.e. some code is removed by the JIT as returned value of that block is never used).
This should not be happenning as the returned value of all the methods that exercise the loops is used to take a decision further down on the program, hence the loops must be executed.

My final thinking was the 'unlikely' scenario that the objects were been placed on the stack. It is known that Hotspot performs this kind of optimizations, by using the output of Escape Analysis.
To be honest I have never seen it happening (or at least I never had the actual time to verify it was indeed happening) until now.

Step 4: Running Without Escape Analysis

The easiest and fastest way to verify the above assumption, that Escape Analysis was feeding into JIT and was causing the objects to get allocated on the stack, is to turn off Escape Analysis. This can be done by adding -XX:-DoEscapeAnalysis in our JVM options.

Indeed, by running the same test again this time we can see that the memory profile for an enhanced-for loop over a collection is steadily increasing. The Iterator objects, created from the ArrayList#iterator() are been allocated on the heap on each loop.

Conclusion

At least for myself the above finding was kind of interesting. In many occasions, mainly because of lack of time, we just make assumptions and empirically follow practises that are "known to be working". Especially for people that are working in a delivery oriented environment, without the luxury to perform some research I would think this is normal. It is interesting though to actually do some research from time to time and try to prove or better understand a point.

Finally, it is worth saying that the above behaviour was observed in an experiment, rather than in actual code. I would imagine the majority of cases in a production system to not exhibit this behaviour (i.e. allocating on the stack), but the
fact that JIT is such a sophisticated piece of software is very encouraging, as it can proactive optimize out code without us realizing the extra gains.

Java, The Cost of a Single Element Loop

In quite a few cases I have seen myself designing code with listeners and callbacks. It is quite common for a class that emits events, to expose an API to attach listener(s) to it. Those listeners are usually stored in a data structure (see List, Set, Array) and when an event is about to be dispatched the listeners are iterated in a loop and the appropriate callback is called.

Something along the lines of:

public class EventDispatcher {

        private final List<Listener> listeners = new ArrayList<>();

        public void dispatchEvent() {
            final MyEvent event = new MyEvent();
            for (Listener listener : this.listeners) {
                listener.onEvent(event);
            }
        }
        public void attachListener(final Listener listener) {
            this.listeners.add(listener);
        }

        public void removeListener(final Listener listener) {
            this.listeners.remove(listener);
        }
    }

    public static class Listener implements EventListener {

        void onEvent(final MyEvent myEvent) {
            // do staff
        }
    }

    public static class MyEvent {

    }

In many cases I have observed that despite the fact that the class is desinged to accept many listeners, the true is actually that just one listener is attached in the majority of the cases.

Hence I wanted to measure the performance penalty paid in case the class had just one listener vs if the class was initially designed to accept just one listener.

In essence I wanted to check the performance impact on the below two cases.

private Listener listener;
        private final List<Listener> singleElementArray = new ArrayList<Listener>(){
            {add(new Listener());}
        };

        public void dispatch() {
            this.listener.onEvent(new MyEvent());
        }

        public void dispatchInLoop() {
            for (int i = 0; i < 1; i++) {
                this.singleElementArray.get(i).onEvent(new MyEvent());
            }
        }

Assumptions Made Prior To Testing

Before creating a benchmark for the above, I made some assumptions:

  • I assumed the single element (single listener in a data container) loop would be unrolled
  • I (wrongly) assumed that the performane cost will not be significant. As effectively with the loop unrolled I would think the native code produced would more or less look close enough

JMH Benchmark

In order to test my assumptions I created the below benchmark:

SingleElementLoopBenchmark.java

Initial Observations

To my surprise I found out that an invocaiton on a single element list was about ~2,5 slower, based on the below throughput numbers:

Benchmark                                                          Mode  Cnt   Score   Error   Units
          SingleElementLoopBenchmark.directInvocation                       thrpt   10   0.317 ± 0.022  ops/ns
          SingleElementLoopBenchmark.singleElementListLoopInvocation        thrpt   10   0.114 ± 0.010  ops/ns

I couldn't really understand why and the above seemed a bit too far from my expecations/assumptions.

The first thing that I verified with JVM argument -XX:+PrintCompilation was that both methods were compiled with C2 compiler, which was the case.

I also tried to print the assembly code with -XX:+PrintAssembly but I couldn't really read/interpret the assembly code.

Resorting to Social Media

I ended up posting a tweet about my findings and asking some pointer on where/how to look for explanations on what I was observing. The answer I got was to try to find the hot methods by using something like perfasm, which would tie the assembly output to the hottest methods of my benchmark.

Which I did with -prof dtraceasm (The benchmark was running on a Mac that's why I used dtrace). The output was the below:

Direct Invocation

9.56%  ↗  0x000000010b73c950: mov    0x40(%rsp),%r10
  1.00%  │  0x000000010b73c955: mov    0xc(%r10),%r10d                ;*getfield dispatcher {reexecute=0 rethrow=0 return_oop=0}
         │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::directInvocation@1 (line 23)
         │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_directInvocation_jmhTest::directInvocation_thrpt_jmhStub@17 (line 121)
  0.17%  │  0x000000010b73c959: mov    0xc(%r12,%r10,8),%r11d         ; implicit exception: dispatches to 0x000000010b73ca12
 11.18%  │  0x000000010b73c95e: test   %r11d,%r11d
  0.00%  │  0x000000010b73c961: je     0x000000010b73c9c9             ;*invokevirtual performAction {reexecute=0 rethrow=0 return_oop=0}
         │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invoke@5 (line 40)
         │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::directInvocation@5 (line 23)
         │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_directInvocation_jmhTest::directInvocation_thrpt_jmhStub@17 (line 121)
 10.69%  │  0x000000010b73c963: mov    %r9,(%rsp)
  0.65%  │  0x000000010b73c967: mov    0x38(%rsp),%rsi
  0.00%  │  0x000000010b73c96c: mov    $0x1,%edx
  0.14%  │  0x000000010b73c971: xchg   %ax,%ax
 10.08%  │  0x000000010b73c973: callq  0x000000010b6c2900             ; ImmutableOopMap{[48]=Oop [56]=Oop [64]=Oop [0]=Oop }
         │                                                            ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
         │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Listener::performAction@2 (line 53)
         │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invoke@5 (line 40)
         │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::directInvocation@5 (line 23)
         │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_directInvocation_jmhTest::directInvocation_thrpt_jmhStub@17 (line 121)
         │                                                            ;   {optimized virtual_call}
  1.44%  │  0x000000010b73c978: mov    (%rsp),%r9
  0.19%  │  0x000000010b73c97c: movzbl 0x94(%r9),%r8d                 ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
         │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_directInvocation_jmhTest::directInvocation_thrpt_jmhStub@30 (line 123)
  9.77%  │  0x000000010b73c984: mov    0x108(%r15),%r10
  0.99%  │  0x000000010b73c98b: add    $0x1,%rbp                      ; ImmutableOopMap{r9=Oop [48]=Oop [56]=Oop [64]=Oop }
         │                                                            ;*ifeq {reexecute=1 rethrow=0 return_oop=0}
         │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_directInvocation_jmhTest::directInvocation_thrpt_jmhStub@30 (line 123)
  0.02%  │  0x000000010b73c98f: test   %eax,(%r10)                    ;   {poll}
  0.28%  │  0x000000010b73c992: test   %r8d,%r8d
  0.00%  ╰  0x000000010b73c995: je     0x000000010b73c950             ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}

Single Element Loop Invocation

         ╭    0x000000011153fa9d: jmp    0x000000011153fad6
  0.19%  │ ↗  0x000000011153fa9f: mov    0x58(%rsp),%r13
  3.55%  │ │  0x000000011153faa4: mov    (%rsp),%rcx
  0.09%  │ │  0x000000011153faa8: mov    0x60(%rsp),%rdx
  0.22%  │ │  0x000000011153faad: mov    0x50(%rsp),%r11
  0.17%  │ │  0x000000011153fab2: mov    0x8(%rsp),%rbx                 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
         │ │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@12 (line 44)
         │ │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
         │ │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  3.55%  │↗│  0x000000011153fab7: movzbl 0x94(%r11),%r8d                ;*goto {reexecute=0 rethrow=0 return_oop=0}
         │││                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@35 (line 44)
         │││                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
         │││                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.16%  │││  0x000000011153fabf: mov    0x108(%r15),%r10
  0.28%  │││  0x000000011153fac6: add    $0x1,%rbx                      ; ImmutableOopMap{r11=Oop rcx=Oop rdx=Oop r13=Oop }
         │││                                                            ;*ifeq {reexecute=1 rethrow=0 return_oop=0}
         │││                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@30 (line 123)
  0.19%  │││  0x000000011153faca: test   %eax,(%r10)                    ;   {poll}
  4.00%  │││  0x000000011153facd: test   %r8d,%r8d
         │││  0x000000011153fad0: jne    0x000000011153fbe9             ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
         │││                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@33 (line 124)
  0.07%  ↘││  0x000000011153fad6: mov    0xc(%rcx),%r8d                 ;*getfield dispatcher {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@1 (line 28)
          ││                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.22%   ││  0x000000011153fada: mov    0x10(%r12,%r8,8),%r10d         ;*getfield singleListenerList {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@4 (line 44)
          ││                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
          ││                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
          ││                                                            ; implicit exception: dispatches to 0x000000011153ff2a
  0.21%   ││  0x000000011153fadf: mov    0x8(%r12,%r10,8),%edi          ; implicit exception: dispatches to 0x000000011153ff3e
  4.39%   ││  0x000000011153fae4: cmp    $0x237565,%edi                 ;   {metadata('com/nikoskatsanos/benchmarks/loops/SingleElementLoopBenchmark$Dispatcher$1')}
          ││  0x000000011153faea: jne    0x000000011153fc92
  0.33%   ││  0x000000011153faf0: lea    (%r12,%r10,8),%r9              ;*invokeinterface size {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@7 (line 44)
          ││                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
          ││                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.09%   ││  0x000000011153faf4: mov    0x10(%r9),%r9d
  0.14%   ││  0x000000011153faf8: test   %r9d,%r9d
          ╰│  0x000000011153fafb: jle    0x000000011153fab7             ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@12 (line 44)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  3.98%    │  0x000000011153fafd: lea    (%r12,%r8,8),%rdi              ;*getfield dispatcher {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@1 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.06%    │  0x000000011153fb01: xor    %r9d,%r9d                      ;*aload_0 {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@15 (line 45)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.09%    │  0x000000011153fb04: mov    0x8(%r12,%r10,8),%esi          ; implicit exception: dispatches to 0x000000011153ff4e
  0.06%    │  0x000000011153fb09: cmp    $0x237565,%esi                 ;   {metadata('com/nikoskatsanos/benchmarks/loops/SingleElementLoopBenchmark$Dispatcher$1')}
  0.00%    │  0x000000011153fb0f: jne    0x000000011153fcc2
  3.93%    │  0x000000011153fb15: lea    (%r12,%r10,8),%rax             ;*invokeinterface get {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@20 (line 45)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.06%    │  0x000000011153fb19: mov    0x10(%rax),%r10d               ;*getfield size {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - java.util.ArrayList::get@2 (line 458)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@20 (line 45)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.11%    │  0x000000011153fb1d: test   %r10d,%r10d
           │  0x000000011153fb20: jl     0x000000011153fcf6             ;*invokestatic checkIndex {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - java.util.Objects::checkIndex@3 (line 372)
           │                                                            ; - java.util.ArrayList::get@5 (line 458)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@20 (line 45)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.28%    │  0x000000011153fb26: cmp    %r10d,%r9d
  0.00%    │  0x000000011153fb29: jae    0x000000011153fc1c
  3.97%    │  0x000000011153fb2f: mov    0x14(%rax),%r10d               ;*getfield elementData {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - java.util.ArrayList::elementData@1 (line 442)
           │                                                            ; - java.util.ArrayList::get@11 (line 459)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@20 (line 45)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.05%    │  0x000000011153fb33: mov    %r9d,%ebp                      ;*invokestatic checkIndex {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - java.util.Objects::checkIndex@3 (line 372)
           │                                                            ; - java.util.ArrayList::get@5 (line 458)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@20 (line 45)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.08%    │  0x000000011153fb36: mov    0xc(%r12,%r10,8),%esi          ; implicit exception: dispatches to 0x000000011153ff62
  1.27%    │  0x000000011153fb3b: cmp    %esi,%ebp
           │  0x000000011153fb3d: jae    0x000000011153fc5a
  3.94%    │  0x000000011153fb43: shl    $0x3,%r10
  0.05%    │  0x000000011153fb47: mov    0x10(%r10,%rbp,4),%r9d         ;*aaload {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - java.util.ArrayList::elementData@5 (line 442)
           │                                                            ; - java.util.ArrayList::get@11 (line 459)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@20 (line 45)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  1.71%    │  0x000000011153fb4c: mov    0x8(%r12,%r9,8),%r10d          ; implicit exception: dispatches to 0x000000011153ff72
 17.85%    │  0x000000011153fb51: cmp    $0x237522,%r10d                ;   {metadata('com/nikoskatsanos/benchmarks/loops/SingleElementLoopBenchmark$Listener')}
  0.00%    │  0x000000011153fb58: jne    0x000000011153fef6             ;*checkcast {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@25 (line 45)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  3.79%    │  0x000000011153fb5e: mov    %rdi,0x18(%rsp)
  0.02%    │  0x000000011153fb63: mov    %r8d,0x10(%rsp)
  0.02%    │  0x000000011153fb68: mov    %rbx,0x8(%rsp)
  0.19%    │  0x000000011153fb6d: mov    %r11,0x50(%rsp)
  3.95%    │  0x000000011153fb72: mov    %rdx,0x60(%rsp)
  0.02%    │  0x000000011153fb77: mov    %rcx,(%rsp)
  0.03%    │  0x000000011153fb7b: mov    %r13,0x58(%rsp)
  0.36%    │  0x000000011153fb80: mov    %rdx,%rsi
  3.78%    │  0x000000011153fb83: mov    $0x1,%edx
  0.01%    │  0x000000011153fb88: vzeroupper
  4.05%    │  0x000000011153fb8b: callq  0x00000001114c2900             ; ImmutableOopMap{[80]=Oop [88]=Oop [96]=Oop [0]=Oop [16]=NarrowOop [24]=Oop }
           │                                                            ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Listener::performAction@2 (line 53)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@29 (line 45)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
           │                                                            ;   {optimized virtual_call}
  0.98%    │  0x000000011153fb90: mov    0x10(%rsp),%r8d
  3.61%    │  0x000000011153fb95: mov    0x10(%r12,%r8,8),%r10d         ;*getfield singleListenerList {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@4 (line 44)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.24%    │  0x000000011153fb9a: mov    0x8(%r12,%r10,8),%r9d          ; implicit exception: dispatches to 0x000000011153ff9e
  0.74%    │  0x000000011153fb9f: inc    %ebp                           ;*iinc {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@32 (line 44)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.04%    │  0x000000011153fba1: cmp    $0x237565,%r9d                 ;   {metadata('com/nikoskatsanos/benchmarks/loops/SingleElementLoopBenchmark$Dispatcher$1')}
  0.00%    │  0x000000011153fba8: jne    0x000000011153fd36
  3.60%    │  0x000000011153fbae: lea    (%r12,%r10,8),%r11             ;*invokeinterface size {reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@7 (line 44)
           │                                                            ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
           │                                                            ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)
  0.11%    │  0x000000011153fbb2: mov    0x10(%r11),%r9d
  0.35%    │  0x000000011153fbb6: cmp    %r9d,%ebp
           ╰  0x000000011153fbb9: jge    0x000000011153fa9f             ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                        ; - com.benchmarks.loops.SingleElementLoopBenchmark$Dispatcher::invokeInLoop@12 (line 44)
                                                                        ; - com.benchmarks.loops.SingleElementLoopBenchmark::singleElementLoopInvocation@5 (line 28)
                                                                        ; - com.benchmarks.loops.generated.SingleElementLoopBenchmark_singleElementLoopInvocation_jmhTest::singleElementLoopInvocation_thrpt_jmhStub@17 (line 121)

As I said I am not really able to read/interpret assembly code, but in between the lines I could see that:

  • The loop was indeed unrolled
  • A penalty was paid to cast the item to the expected type (17.85% of the CPU instructions)
  • A penalty was paid to fetch the item from the list, underlying array

In order to get some advice from someone knowledgable on this I posted the below question on StackOverflow. The answer is pretty comprehensive, as the person who answered is one of the most prominent names in JVM community

StackOveflow: Java Method Direct Invocation vs Single Element Loop

Conclusion/Observations

In summary:

  • The loop was indeed unrolled, as expected and as seen from the assembly code
  • The main penalty paid is for fetching the element from the list and casting it to the expected type
  • Some cost is also because of checks performed on the data container itself (i.e. size)
  • In general the extra cost been paid is memory access cost, rather than CPU instructions cost

As seen in the SO answer, Andrei makes the point that invoking the object's method from inside the loop is not ~2,5 times slower, but rather 3 ns slower, if we look it from a perspective of latency (ns/op) rather than throughput (ops/ns). This is a valid point, but I am not sure If i aggree 100%, as in some applications, depending on the nature, that extra cost will actually translate in ~2,5.

Finally, I have added in the JMH Benchmark test, tests for different data container types:

  • Array
  • List
  • Set

Observing the numbers of those, and as expected, an array is faster than the rest. The array is typed, hence the casting cost is not paid. The array underlying an array list is of type Object, hence the need for casting to the list's type.

Managing JDKs in MacOS

With the increasing number of JDK builds and the more frequent release candence, I found it hard to keep track what I had installed in my MacOS and switch between them on the fly.

Even in 2019 my preferred version of Java is 1.8, probably because this is the version I am using at my work. But depending on the occasion I find myself experimenting with newer features from later versions, or even from experimental builds:

  • JShell from Java 9 onwards
  • EpsilonGC
  • The use of var since Java 10
  • Value types in Project Valhalla builds
  • etc…

In addition, nowadays on my personal computer I mainly use Java builds from the AdoptOpenJDK project. But there are other builds which I have installed on my MacOS to try out:

Hence I spent some putting together some bash functions that give me a hand managing and switching between those versions.

For the complete bash script see here, but the highlights are:

  • List JKDs
  • List Different JDK builds
  • Set JDK to a specific version/vendor

Note that java_home=/usr/libexec/java_home

Identifying a high-CPU Java Thread

High CPU utilization Java application

Every now and then we find ourselves in situations when a single Java process is consuming a high percentage of CPU.

After investigating and ruling out high CPU because of continuous GC cycles or other pathogenic reasons, we find ourselves in a situation that we need to identify the business logic that causes those CPU spikes. An easy way of doing so is try to identify the thread(s) that is consuming most of the CPU and try to pinpoint the caveat.

There are a few utilities (i.e. top, htop) that let us see a process as a tree along with the threads that live inside that process' space. After identifying the thread's ID, it is pretty easy to translate the ID to its HEX value and identify the actual thread in a Java application (i.e. by taking a thread dump).

Example

As an example the following Java program, uses two application thread's (main thread and a thread created by the user), one thread is spinning forever generating random values. The main thread occasionally, reads those random values.

https://github.com/nikkatsa/nk-playground/blob/master/nk-dummies/src/main/java/com/nikoskatsanos/spinningthread/SpinningThread.java

It is expected that this would be a high CPU utilization application (see above image).

Find the Rogue Thread

After identifying the Java program's PID (i.e. with jps or something like ps, top, htop), we can run an application like htop as below

htop -p${PID}

A user can view that isolated process along with its threads. Usually htop would show user space threads by default, but if not is easy to do by going to the setup page and selecting the appropriate option on Setup -> Display Options.

Then a user should see an image like the below.

That shows the application's PID along with its threads, reporting the metrics (CPU, Memory etc) for each thread. From there someone can easily identify that thread 12820 is consuming a great percentage of CPU, hence it should be our caveat.

Translating Thread's ID to HEX

The next step would be to translate that thread's decimal ID to its HEX value, which is: 0x3214

Getting a thread dump

Knowing the thread's HEX value, the user can take a thread dump and easily locate the thread and its stack trace.

Full thread dump Java HotSpot(TM) Client VM (25.65-b01 mixed mode):

"Attach Listener" #8 daemon prio=9 os_prio=0 tid=0x64900800 nid=0x3340 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Spinner" #7 daemon prio=5 os_prio=0 tid=0x6442a800 nid=0x3214 runnable [0x6467d000]
   java.lang.Thread.State: RUNNABLE
        at java.util.concurrent.ThreadLocalRandom.nextDouble(ThreadLocalRandom.java:442)
        at com.nikoskatsanos.spinningthread.SpinningThread.spin(SpinningThread.java:16)
        at com.nikoskatsanos.spinningthread.SpinningThread$$Lambda$1/28014437.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - <0x659c2198> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"Service Thread" #6 daemon prio=9 os_prio=0 tid=0x76183c00 nid=0x3212 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"C1 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x76180c00 nid=0x3211 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x7617f000 nid=0x3210 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x76162000 nid=0x320f in Object.wait() [0x64f9c000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x65806400> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
        - locked <0x65806400> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

   Locked ownable synchronizers:
        - None
"C1 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x76180c00 nid=0x3211 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x7617f000 nid=0x3210 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x76162000 nid=0x320f in Object.wait() [0x64f9c000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x65806400> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
        - locked <0x65806400> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

   Locked ownable synchronizers:
        - None

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x76160800 nid=0x320e in Object.wait() [0x64fec000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x65805ef8> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:157)
        - locked <0x65805ef8> (a java.lang.ref.Reference$Lock)

   Locked ownable synchronizers:
        - None

"main" #1 prio=5 os_prio=0 tid=0x76107400 nid=0x320c waiting on condition [0x762b1000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at java.lang.Thread.sleep(Thread.java:340)
        at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
        at com.nikoskatsanos.spinningthread.SpinningThread.main(SpinningThread.java:40)

   Locked ownable synchronizers:
        - None

"VM Thread" os_prio=0 tid=0x7615d400 nid=0x320d runnable

"VM Periodic Task Thread" os_prio=0 tid=0x76185c00 nid=0x3213 waiting on condition

JNI global references: 310

The nid value (nid=0x3214) should match the HEX value of the thread's decimal ID

As seen, in the above case is obvious that thread with name 'Spinner' is the high CPU utilization thread we are looking for. After this point the user can investigate the application's logic and determine the root cause.