Fall 2024 • Daniel Shiffman

Machine Learning for Creative Computing

Week 08 Depth Estimation Playground

Ever since a couple of weeks ago, when doing my first experiments with depth estimation, I've been somewhat enthralled by its graphic qualities. It feels particularly magical to be able to somewhat derive depth from a flat image, like enhancing a memory.

I wanted to be able to play with this more easily, so I built an online tool to apply it to any image uploaded or captured by the webcam:

Homescreen of a website with two buttons, each offering different ways of uploading an image.
The interface homescreen.
Visit tool online

After making the web app and sharing it with some friends, it became clear that I needed to add controls for the depth. Since the models do monocular depth estimation, they can't really tell the absolute z depths. This is apparently a very complex problem due to how different lenses portray depth. With this done, I tested it out on some images:

3D rendition of a fruit shop.
3D Fruit shop in Rome.
Photo of a fruit shop.
Source image of the fruit shop.
3D rendition of a bust wearing a face mask.
3D bust with face mask.
Photo of a bust with a face mask.
Source image of the bust.
3D rendition of a street in Gwangju, Korea.
3D street in Gwangju, Korea. Source was a 360 photo!

These explorations opened up some interesting creative possibilities. I want to keep exploring 360 images, as well as images of locations, which seem to look the most realistic.

Furthermore, I want to extend the tool so that I may add multiple 3D images to a same "canvas" and create a composition by moving them around and setting different depths. This way I can prototype how it could work to stitch them together.

Week 05 & 06 ML sketch: Footmark to Keypoint Estimation

To record the data to train a model based on my proposal, I built a local p5.js sketch that receives the pressure sensitive mat data through websockets, visualizes it and at the same time runs the ml5.js BlazePose model on the webcam. Specifically pointing out the lower body points in x, y and z.

Web interface with 16x16 grayscale image on the left, webcam image on the right and a red button below that reads start.
The interface during data recording.

Upon hitting record, the sketch adds the data to the ml5 neural network model object every 200ms, only if the last received mat readings through websockets and the last predicted blazepose keypoints are in time sync.

After recording, I proceeded to save the data to json and make a sketch to train the model. Since this was a regression task with convolutional layers, I was a bit at a loss as to how to structure the architecture of the model, since the only CNN that ml5 has is the imageClassification task. I tested out a few configurations based on LLM recommendations, this one was the one that made most sense to me with regards to how I understand the way CNNs work:


layers: [
    { 
        type: 'conv2d', 
        filters: 8, 
        kernelSize: 3, 
        activation: 'relu', 
        inputShape: [16, 16, 1] 
    },
    { 
        type: 'maxPooling2d', 
        poolSize: [2, 2] 
    },
    { 
        type: 'conv2d', 
        filters: 16, 
        kernelSize: 3, 
        activation: 'relu'
    },
    { 
        type: 'maxPooling2d', 
        poolSize: [2, 2]
    },
    { 
        type: 'flatten'
    },
    { 
        type: 'dense',
        units: 32,
        activation: 'relu' },
    { 
        type: 'dense', 
        units: 30 
    }
]
                

Input: a 256 long one dimensional array of values between 0 and 1, representing the level of pressure on the mat at each point.

Outputs: 30 separate single values, each corresponding to one of the x,y,z coordinates of one of the lower body pose keypoints.

However, all of the ones I tested had the same issue: my loss graph immediately went to 0 after the first epoch:

Tensorflow.js debugging panel showing a loss graph that dips down to 0 after the first epoch.
The loss graph. Clearly indicating an issue.
Tensorflow.js debugging panel showing a loss graph that dips down and wiggles around 0.04.
After tweaking the training parameters during class and adding ml5 data normalization function, even though the data is already normalized.

Seemed to me that there could be many issues:

  • Too few datapoints was causing it to overfit
  • ML5 was not liking me doing a regression with convolutional layers
  • Training data was not correclty formatted
  • Layer structure was wrong

A few days later, I went back to export my data and run some predictions to see how it worked. When I tried to train again, I got this error originating in ML5's callcallback.js file, after a few epochs:


Uncaught (in promise) TypeError: t is not a function

Right around then, Fabri had told me he faced some similar issues and he managed to go past them by training and running the model directly with Tensorflow.js. He shared his code and I worked from there to transfer my project to Tensorflow.js:

Input: Tensor with the shape [16, 16, 1] corresponding to the pixels from the mat.

Outputs: Tensor with 30 discrete values.

Loss graph from training a CNN.
My first run at training the model in tensorflow.js using the data collected before. Around 400 datapoints. I graphed the loss on excel because I didn't manage to set up the tf visor.
Web interface with camera feed on the right, pixelated grayscale image on the left and a button on the bottom left labelled Start.
Got set up for another round of data collection, this time I recorded around 1,300 data samples.
Loss graph from training a CNN.
Loss graph from my second round of training with the new data.

This was my architecture and training parameters:


model = tf.sequential();
model.add(tf.layers.conv2d({
    filters: 8,
    kernelSize: 3,
    activation: 'relu',
    inputShape: [16, 16, 1]
}));
model.add(tf.layers.maxPooling2d({ poolSize: [2, 2] }));

model.add(tf.layers.conv2d({
    filters: 16,
    kernelSize: 3,
    activation: 'relu'
}));
model.add(tf.layers.maxPooling2d({ poolSize: [2, 2] }));

model.add(tf.layers.flatten());
model.add(tf.layers.dense({ units: 32, activation: 'relu' }));
model.add(tf.layers.dense({ units: 30 }));

model.compile({
    optimizer: tf.train.adam(0.0001),
    loss: 'meanSquaredError'
});
            
Realtime estimations made by this model.

Overall I feel like a model of this type could work! The next steps I envision for this are:

  • Tweaking the visualization to see the Z coordinate better and be able to evaluate the accuracy.
  • Preprocessing the mat image with marching squares blob detection before passing it onto the model to reduce noise.
  • Training with more datapoints.
  • Training with a higher resolution mat.

Week 04 ML Sketch Proposal: Footmark to Pose Estimation

For this week, I began thinking about a project suitable for training a neural network. Ever since I have been making interactive experiences, I have been writing algorithms to try to detect and react to people.

When I have attempted to "translate" some sensor data to a real-life parameter (for example, the readings of an HC-SR04 to the distance of a user), I have often experienced how my algorithm fails in some unforeseen edge case (ex. as a user approaches my prototype, somebody passes by and confuses the sensor).

The solution has usually been reworking my algorithm, adding more sensors to expand the "sensing field" of my system and/or iterating on the affordances of my prototype to better shape users' interaction intuitions. This usually takes me about 90% of the way there, to the point where breaks in the system are usually noticeable only to me and my teammates.

However, I can't shake the feeling that, if I could rework my algorithm very very meticulously, if I could consider most edge cases and their complex interconnections, I could perhaps bypass the need of adding extra sensors. Or maybe get closer to the 99% of functioning. This is what I think training a neural network could achieve, atleast in some cases.

Last semester, working on Trails Overhead with Audrey Oh, I had this intuition. We had built a velostat mat and were trying to translate the pressure points into the real-life parameter of where the feet are (and whose feet are whose):

Grid with each cell in a shade of gray and a corresponding number.
The raw data coming in from the velostat mat. Here, one person is walking north to south.
Grid of cells in shades of gray. A red outline around the darker areas.
Using a marching squares algorithm to find footprints on the data. The next step we tried was estimating the angle of the foot, drawing a line across its longest side.
Grid of cells in shades of gray. A red outline around the darker areas.
Quickly it was evident that, with this sensor resolution, translating the footsteps to the "real-life" parameters was very complex.
Blurry visual with darker spots and blurred out gray spots.
In the end we decided to keep our final visual closer to the original data, allowing users to imagine and interpret the data themselves.

In this scenario, I can't help but think that writing an algorithm is possible, just very very complex. I also have the theory that, because balance ourselves on our feet, any pose we do must have some (maybe minute) changes in weight distribution on our feet, which could be sensed.

Line drawing. Person standing on top of a grid mat. On the right, a top-down view of the grid mat. The cells where the person is standing are colored in shades of gray. Below that, the points where the persons joints are, are shown in green.
This is the training scenario I imagine, capturing data at the same time from the mat and two views of the person, using ml5 pose estimation.

By training a model this way, I want to explore the possibilities that using a sensor like this for high fidelity pose estimation could give.

Inputs: Velostat pressure sensor pixel buffer, maybe as a 16x16 grayscale image. Or maybe just as a one-dimensional array of grayscale pixels.

Outputs: XYZ points of the joints of the person with respect to the mat. Might focus on a small set of points for the initial testing.

Learning task: I'm not sure yet, It seems to me that it would be similar to a classification of sorts.

Challenges: I worry that the resolution of the velostat mat might be too low to give the model enough information. Also, I worry that this task may be too complicated to be trained on the browser. But I am unsure of this. I also worry about the latency when capturing data. What if the velostat readings are available faster/slower than the ml5 ones?

For this first test, I want to limit the training data to just one user at a time.

Week 03 Photogrammetry with Depth Estimation

As soon as we saw the depth estimation model example in class, I was immediately intrigued. My experience with depth sensing had always been with the very specialized depth cameras or kinect derivatives. I wanted to see what a point cloud would look like with this model:

Screenshot of man made up of pixels in a 3D space.
Here I drew the input from the webcam in 3D space, making one plane for each pixel, and using the estimated depth as the z.
See Point Cloud sketch
Screenshot of man made up of pixels in a 3D space.

This was very exciting! But since these were just points, when you turned you could see through them. I wanted to try and build a 3D model with this. So I started to look into p5 geometries. Because this was a more intensive process, I couldn't make it run continuously but rather require the user input to take a picture.

I made a loop that would add each 3D "pixel" of the image to a p5 geometry and set its faces manually:

Screenshot of 3D model with the silhouette of a bust in the corner.
At first I couldn't understand why my objects had so much dead space around. In the end it turned out to be just that the first rendering had that problem. After rendering more "pictures" it worked as expected
Screenshot of 3D model zoomed in.
I had a hard time coding the faces of the model, since the computeFaces() function didn't do a very good job due to the complexity of the shape.
Screenshot of 3D model of a bust.
Once I had the faces figured out, I had the model! This is a view of the 3D model as an STL file, downloaded on my computer.

With the shape set, I couldn't help but want to texture it with the photo that was captured through the webcam! Since I learned how this worked from last week's experiments, I set up the UVs to make it possible:

Screenshot of 3D model of a man and a photo of them below it.
I love the low poly look it gives. I wonder how much it would change if I generated the model with more pixels.
See Photogrammetry Photo Booth Sketch
Screenshot of 3D model of a man with their hand stretched out.
It looks best when varying depths are part of the picture. I could imagine this sort of image being used in a poster.
Screenshot of 3D model of a man with their hand stretched out.

I'm really drawn to the creative possibilities of this method. I wonder if there's any way of optimizing what I'm doing further, so that we can get it running somewhat close to realtime. But maybe p5 is not the best platform for this.

Week 02 Reading on Transfer Learning and Realtime FaceMesh UV Unwrapping

This week I went through the additional material to understand the theory better. The Tensorflow transfer learning article made the concept very clear. Made me wonder how far can we push the new data (in relation to the original training data) and still get an effective transfer learning result. 3blue1brown's videos 1 and 2 also helped me cement some more knowledge, but I still have to get around to the third part.

Afterwards I started experimenting with FaceMesh, intrigued by the possibility of mapping the face as a real-time texture. I began by trying to understand the triangle distribution:

Screenshot of triangle mesh on the top part of the image. Screenshot of javascript console on the bottom.
After comparing the triangles on the facemesh UV image and the triangles given by the getTriangles() method, I realized they are not the same, as seen on this image. Not sure why that is, but it turned out to not to cause any problems.
Screenshot of face composed of misaligned triangles.
Trying to understand the triangles. This is the result of drawing all vertices in sequence, instead of drawing them in the order of triangles. Fun looking mistake, makes me think of new ideas.

Once this was clear, I wanted to add the eyes onto the mesh:

Zoomed in screenshot of 3D face with wire mesh overlayed, without eye.
Before stitching in the new triangles for the eyes.
Zoomed in screenshot of 3D face with wire mesh overlayed.
Result after stitching in the eyes.
See FaceMesh with Eyes sketch

After this, I wanted to figure out how to flatten the face. How to get a realtime unwrapped UV map of the face.

Screenshot of face mesh in green over black background.
I used the getUVCoords() method to get the static 2D locations of each vertex of the face model. These I then mapped to the dimensions of the canvas.
Zoomed in grayscale screenshot of UV unwrapped face texture.
I then created the geometry using the static locations, but gave it a dynamic UV map based on the realtime location of each vertex on the webcam video, flattening the face no matter where it was at. It looks very janky on the edges, but It makes sense given it has less pixel information there.
See Realtime UV Unwrap Sketch

It would be great if the base static model could be accessed not only in 2D but in full 3D directly with a facemesh method. Right now seems like this is not included, here is some discussion on a workaround that I found very informative.

That being said, In my understanding, this model homogenizes all facial structures to follow the base 3D face model. It is, for example, uncapable of capturing the actual shape of my nose, even when I look sideways.

It sort of makes sense considering the model is meant to add fun 3D effects to/around the face, and not actually make accurate photogrammetry. This I think is something creators need to understand before making experiences using the model, as the depiction of the user's 3D face on screen is a distortion of their real appearance.

Week 01 "Excavating AI", "Humans of AI" and Image Classification

"Excavating AI" uncovered a lot of the black box behind the categorization of data for AI for me. It is very confounding to see how little thought was given to some of the categories of the ImageNet dataset before proceeding with the labelling. I can't imagine what good or non-harmful use could be derived of categorizing people through such detailed and loaded classes.

The lack of context in the images is also a big issue. Vision is a time-based sense, we are continually seeing and reinterpreting our vision based on different levels of context, including the events happening immediately before. Instead, these datasets look only at one moment, extracted from time and space, ignorant to any sort of contextual clues. With this in mind, it is hard to imagine how AI systems trained on this dataset (and the clickworkers) could not get the wrong idea most of the time, when it comes to classifying people.

Some more personal levels of this context are explored in "Humans of A.I.", in which very meticulous work of archeology is done to find the original creators of the images used to train the COCO dataset, and give not only recognition to their images but also look into what these images and their subjects meant for them. Through this work it is easy to see oneself reflected in the work behind ai models, in a way they seem to be things we build together. However the way the data is treated, through obscuring and decontextualizing, breaks this apart.

I wonder about the plausibility of making fulling transparent ai models. Calls for participating in a dataset, willingly and with credits. How could these credits be a part of the model itself and not of the readme? Could one of the object literal key-value pairs, returned by the model when making a prediction, contain the name of a dataset contributor? Is there a good way to choose a contributor dynamically in a way that is also related to the prediction?

Screenshot of Teachable Machine model training UI.
Image of me training a Teachable Machine model to detect whether I'm facing left, right or center.

For the exercise this week, working with image classification, I had a hard time choosing what to create. Every time I have worked with training image classification in Teachable Machine I always face the same issue that the trained model only works well in the same controlled conditions. As soon as the lighting or location changes, things start to break down.

To try to combat that, I tried training a classification model with poses instead of images per se, as I think that would make it more resistant to changing conditions.

It was really hard to get it to detect face orientation reliably. Clearly this is something better suited to an if statement but just for the sake of testing I pushed on.

Screenshot of p5js sketch detecting a face looking forward.
Detecting me looking straight forward.
Screenshot of p5js sketch detecting a face showing its right side.
Detecting the right side of my face.
Screenshot of p5js sketch detecting a face showing its left side.
Detecting the left side of my face.

I soon figured out that the given example for p5js didnt work for this type of classification, so I had to look up how to use this type of teachable machine model with p5. I found an example by Dan O Sullivan that worked great, so I started out there.

This is the test sketch I made (pictured above), because it is using poses, I hope it will work with other people and not only me, but I have yet to test that out.