blank

An opportunity for Puerto Rico Economy

2022-01-06T00:00:00+00:00

Universidad de Puerto Rico, Rio Piedras.

On a small island as Puerto Rico, the lack of raw materials and expensive transportation limits the industry growth. For this reason, big companies such as Walmart, Tesla, and Apple are hard to create from Puerto Rico. Software companies have been growing since the 2000’s to the point that they have reached the top stock market status. Incredible that in 2021 the Caribbean Business top 200 locally owned companies do not have a single software company.

Software companies have their advantages; 1) do not require raw materials, and 2) their distribution it’s almost free because of the internet. Software company’s “raw materials” are disruptive ideas and computational education. Unlike oil, gas, and gold, ideas can grow in any part of Earth. Similarly, computer science can be learned from any part of the world, thanks to the internet. Most software services companies do not require a significant investment initially, but it’s crucial to be disruptive enough to compete globally.

The competition is brutal, especially in regions like Silicon Valley, Shenzhen, and Bengaluru because of its highly developed computational education. But still, Puerto Rican companies had the advantage to tackle unique local niche problems. They are the only ones that can create adequate solutions to the needs of this local niche and then expand to other regions with similar issues. On the other hand, another strategy is researching and developing technologies to obtain an advantage over other companies in the world. Ideas with emerging technologies such as quantum computing, bioinformatics, and artificial intelligence could give market advantages through patents and the commercialization of these advances.

It is no secret that universities with excellent computer science and engineering programs have been very influential since the beginning of Silicon Valley. Higher education and research advances helped create the technology in companies like Apple, Google, Netflix, etc. For example, the PageRank search algorithm was developed as part of research at Stanford University by the founders of Google. Similarly, Puerto Rico has the potential to follow in the footsteps of Silicon Valley if it invests in its education to develop sufficient intellectual capital to develop a thriving software industry.

In conclusion, the software industry is ideal for Puerto Rico because it does not require raw materials or high transportation costs. Although the competition is global, Puerto Rico has two advantages 1) taking advantage of the local niche and 2) developing new technologies. This last strategy requires investing in education to include programs focused on computer science at all levels to increase the number of local companies in the software area. And who knows, maybe we will have our first unicorn (companies valued at or more than 1 Billion dollars) in a short time.

An Intuitive Introduction to Machine Learning

2021-11-07T00:00:00+00:00

Introduction

I believe everyone has heard about machine learning and how it has been accelerating science and industries. Protein folding, antibiotic discovery, and robust animal behavior monitoring are examples of how machine learning has accelerated science advances. Many industries such as finance, health, and even software engineering have been applying machine learning to facilitate, automate, or guide essential processes. Machine learning has changed the paradigms from explicitly programming to training a machine learning model for some tasks that are hard to program. Andrej Karpathy explained more about this in his blog post titled Software 2.0.

In many applications, machine learning seems to work like magic but isn’t magic. The purpose of this blog post is to uncover the magic behind machine learning and answer; how does machine learning learn from the data to make decisions?

Keep in mind that the goal of machine learning is to learn a decision function based on training data meanwhile generalizing to new examples. There are three crucial aspects from this description: 1) the decision function; 2) how efficiently represent the input data?; 3) how to measure the model generalization? In the next section, I will introduce the intuition of these aspects but first a motivation example.

Decision Function

Imagine a simple case where you are designing an application to know if today is a good day to go to the beach? Usually, when I go to the beach, I check two measurements: 1) precipitation probability and 2) rip currents. If we collect these measurements from previous days and plot them, we get the following graph in Figure 1. This graph has labeled examples of good days (blue dots) and bad days (red dots). Note that the x-axis shows the precipitation probability, and the y-axis shows rip currents.

Figure 1: Precipitation probability versus rip currents by good/bad day class.

If I ask you, based on today’s measurements (grey dot), if today is a good day to go to the beach, what would be the answer? If you answered yes, you immediately noticed a pattern on the graph; good days are clustered at the bottom left of the graph, therefore because today’s dot is in that region, then today should be a good day to go to the beach. But what mean to be in the blue region? Well, good days have low precipitation probability and low rip current.

But how to formalize this pattern? First, we need to define what is a decision function. A decision function is a function that receives features or measurements as inputs and decides which class to assign based on the training data points. For a given training data could exist many decision functions. This depends on the complexity of the model and the data. Figure 2 shows three different decision functions that are valid for our example training dataset. Figure 2a shows a simple line (Logistic Regression) that divides good day examples from bad day examples.

a): Simple Decision Function.

b): Complex Decision Function.

c): Very Complex Decision Function.

Figure 2: Precipitation probability versus rip currents by good/bad day class.

Feature Representation

In machine learning, there are two things we can have control the model and the data. In this section, I will talk about the more important of both; the data. The data can have multiple representations and features, some relevant and others irrelevant to the target task. It is critical that the model receives enough information to make a good decision. You cannot expect a machine learning model to figure out the solution with irrelevant features or incomplete information.

An example of irrelevant or incomplete information is Figure 3 that shows a graph where the y-axis was changed from precipitation probability to wind speed, which is irrelevant to solving this task. Notice that we introduce an irrelevant feature and remove part of the information relevant to solve this task. In the best case, the model ignores the wind speed features and relies on only rip currents. Depending on just precipitation probability, the model doesn’t have the complete information to make a good choice. The lesson here is that machine learning learns from reasonable patterns but does not make magic from data that does not make sense.

Figure 3: Example of bad feature selection.

Generally, data scientists spend a considerable amount of time considering which features are relevant to solving the task. It is critical to remove irrelevant features because they can introduce noise to the model. Some features might are not good by themselves but combining them with others and transforming them into new combined features could result in relevant features. This process is called feature engineering, which consists of collecting and transforming certain features to simplify the feature representation of the problem.

Generalization

Now that we have a good representation and a good model that learn perfectly the training data. Are we ready to deploy our application to predict data from users? Well, not yet. First, we need to verify if our model can generalize to new data points never seen before. But how can we measure the generalization to all possible future data points? Should we collect the whole possible data points in our training set? The answer is no. Generally, we divide the dataset into two folds: the training set and the testing set. The idea is to evaluate the model on the testing set that contains novel examples that do not appear in the training set to approximate the generalization error.

Now that we have a way to approximate the generalization of a model, you can encounter the following scenarios:

Poor training and testing performance (Underfitting)
Good Training and poor testing performance (Overfitting)
Good training and testing performance

a): Underfitting.

b): Good Fit.

c): Overfitting.

Figure 4: Examples of fitting.

Poor training performance indicates underfitting, meaning that your feature representation is not adequate or that the model is not complex enough for the training dataset. If various models do not provide good performance, maybe you should simplify the feature representation by doing feature engineering.

Assume that you have a model that learns how to perform some task, and when you tried to evaluate the model on the testing dataset, you found that it has a poor performance. This situation means that the model learned was capable of memorizing the training but did not generalize well; this is a sign of overfitting, which maybe be due to the high complexity of your model. You could try techniques to avoid overfitting. One of them is to reduce the complexity of your model.

If you encounter a good training and testing performance, you are good to go. Note that the testing performance most of the time is lower than the training performance. Maybe you should try to optimize some models parameter to improve your model performance.

Conclusion

In this introductory post about machine learning, we discussed how decision function uses training data to make decisions. We made an emphasis on how the input data representation could affect your model performance. Also, we introduce the testing set as a way to approximate the generalization error. I hope that after this post, you now have good intuition and understand how machine learning works.

Pollen Classification

2021-09-21T00:00:00+00:00

This post shows how to train a convolutional network for pollen classification. We used part of the MobileNetV2 network for feature extraction and one ReLU layer with one sigmoid layer for classification.

View source on Github

Dependecies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Flatten, Dense, Input
from tensorflow.keras.applications import MobileNetV2
from tensorflow_addons.metrics import F1Score
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

Dataset Functions

Here we are using the tf.keras.preprocessing.image_dataset_from_directory function to load the dataset from the images/ directory. The labels of the images are inferred by the name of the folder that contains them.

images/
...NP/
......a_image_1.jpg
......a_image_2.jpg
...P/
......b_image_1.jpg
......b_image_2.jpg

def normalize_image(image,label):
    image = tf.cast(image/255. ,tf.float32)
    return image,label

train_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    "images/",
    labels="inferred",
    label_mode="binary",
    color_mode="rgb",
    batch_size=32,
    image_size=(90, 90),
    shuffle=True,
    seed=42,
    validation_split=0.2,
    subset="training"
).map(normalize_image)

valid_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    "images/",
    labels="inferred",
    label_mode="binary",
    color_mode="rgb",
    batch_size=32,
    image_size=(90, 90),
    shuffle=True,
    seed=42,
    validation_split=0.2,
    subset="validation",
).map(normalize_image)

Here we plot some examples to see how are the images in this dataset. We can identify variations on bee pose and size, illumination, rotation and etc.

fig, ax = plt.subplots(4, 8, figsize=(20, 15))
axes = ax.ravel()

gen = iter(train_dataset)
sample_batch = next(gen)

for i, (image, label) in enumerate(zip(sample_batch[0], sample_batch[1])):
    axes[i].imshow(image)
    label_str = "Pollen" if label[0] else "No Pollen"
    axes[i].set_title("{}".format(label_str))
    axes[i].set_xticks([])
    axes[i].set_yticks([])

Dataset examples.

MobileNetV2 as Feature extractor

In this notebook we are using a MobileNetV2 which comes with keras. You can find other pre-made models on tf.keras.applications. More details about the models here. We cut the network at the layer block_6 to have a resolution of 12x12 for the features.

backbone = MobileNetV2(include_top=False, input_shape=(90, 90, 3))
model_input = backbone.input
model_out = backbone.get_layer("block_6_expand_relu").output
feature_extractor = Model(model_input, model_out)

Classification Layer

class Classifier(tf.keras.Model):
    def __init__(self, base_model, filters=64, classes=2):
        super(Classifier, self).__init__()
        self.backbone = base_model
        self.flatten = Flatten(name='flatten')
        self.dense = Dense(filters,activation='relu', name="ReLU_layer")
        if classes == 1:
            self.classifier = Dense(classes, activation="sigmoid", name="sigmoid_layer")
        else:
            self.classifier = Dense(classes, activation="softmax")
        self.model_name = "Classifier"
        
    def call(self, data):
        x = data
        x = self.backbone(x)
        x = self.flatten(x)
        x = self.dense(x)
        id_class = self.classifier(x)
        return id_class


model = Classifier(feature_extractor, classes=1)

Model Diagram.

Model Training

The optimization loss of this model is the binary cross-entropy.

\[loss = - \frac{1}{N} \sum_i^N y_i \log{\hat{y}_i} + (1 - y_i) \log (1 - \hat{y}_i)\]

model.compile(loss='binary_crossentropy', optimizer="adam",metrics=['accuracy', F1Score(num_classes=1, threshold=0.5)])

We used F1Score metric to have a good idea of the performance of the model because our pollen dataset is unbalanced (we have a lot more images labeled as No pollen than Pollen.)

history = model.fit(train_dataset, epochs=20, validation_data=valid_dataset)
history_df = pd.DataFrame(history.history, index=history.epoch)

Epoch 1/20
140/140 [==============================] - 12s 67ms/step - loss: 0.5654 - accuracy: 0.9096 - f1_score: 0.8008 - val_loss: 0.6154 - val_accuracy: 0.8317 - val_f1_score: 0.4689
Epoch 2/20
140/140 [==============================] - 9s 63ms/step - loss: 0.0517 - accuracy: 0.9839 - f1_score: 0.9656 - val_loss: 0.5985 - val_accuracy: 0.8335 - val_f1_score: 0.4775
Epoch 3/20
140/140 [==============================] - 9s 62ms/step - loss: 0.0241 - accuracy: 0.9915 - f1_score: 0.9819 - val_loss: 0.3709 - val_accuracy: 0.9042 - val_f1_score: 0.7540
Epoch 4/20
140/140 [==============================] - 9s 63ms/step - loss: 0.0071 - accuracy: 0.9987 - f1_score: 0.9972 - val_loss: 0.3563 - val_accuracy: 0.9141 - val_f1_score: 0.7848
Epoch 5/20
140/140 [==============================] - 9s 63ms/step - loss: 0.0074 - accuracy: 0.9975 - f1_score: 0.9948 - val_loss: 0.3406 - val_accuracy: 0.9096 - val_f1_score: 0.7710
Epoch 6/20
140/140 [==============================] - 9s 61ms/step - loss: 0.0034 - accuracy: 0.9996 - f1_score: 0.9991 - val_loss: 0.4709 - val_accuracy: 0.8962 - val_f1_score: 0.7277
Epoch 7/20
140/140 [==============================] - 9s 62ms/step - loss: 8.3022e-04 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.3459 - val_accuracy: 0.9194 - val_f1_score: 0.8009
Epoch 8/20
140/140 [==============================] - 9s 64ms/step - loss: 2.3191e-04 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.2589 - val_accuracy: 0.9364 - val_f1_score: 0.8493
Epoch 9/20
140/140 [==============================] - 9s 62ms/step - loss: 1.4356e-04 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.2349 - val_accuracy: 0.9409 - val_f1_score: 0.8613
Epoch 10/20
140/140 [==============================] - 9s 63ms/step - loss: 9.4333e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1998 - val_accuracy: 0.9508 - val_f1_score: 0.8871
Epoch 11/20
140/140 [==============================] - 9s 62ms/step - loss: 8.5224e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1852 - val_accuracy: 0.9552 - val_f1_score: 0.8984
Epoch 12/20
140/140 [==============================] - 9s 62ms/step - loss: 6.3893e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1726 - val_accuracy: 0.9597 - val_f1_score: 0.9095
Epoch 13/20
140/140 [==============================] - 9s 62ms/step - loss: 5.8994e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1611 - val_accuracy: 0.9624 - val_f1_score: 0.9160
Epoch 14/20
140/140 [==============================] - 9s 63ms/step - loss: 4.3215e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1542 - val_accuracy: 0.9642 - val_f1_score: 0.9203
Epoch 15/20
140/140 [==============================] - 9s 63ms/step - loss: 5.1431e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1408 - val_accuracy: 0.9678 - val_f1_score: 0.9289
Epoch 16/20
140/140 [==============================] - 9s 62ms/step - loss: 3.9965e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1428 - val_accuracy: 0.9678 - val_f1_score: 0.9289
Epoch 17/20
140/140 [==============================] - 9s 63ms/step - loss: 3.5314e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1373 - val_accuracy: 0.9687 - val_f1_score: 0.9310
Epoch 18/20
140/140 [==============================] - 9s 62ms/step - loss: 2.9370e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1386 - val_accuracy: 0.9696 - val_f1_score: 0.9331
Epoch 19/20
140/140 [==============================] - 9s 63ms/step - loss: 2.4445e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1319 - val_accuracy: 0.9696 - val_f1_score: 0.9331
Epoch 20/20
140/140 [==============================] - 9s 63ms/step - loss: 2.5461e-05 - accuracy: 1.0000 - f1_score: 1.0000 - val_loss: 0.1306 - val_accuracy: 0.9722 - val_f1_score: 0.9393

Check Training

Seems that our model is not overfitting both training and validation curves decrease over time.

plt.plot(history_df["loss"], label="loss");
plt.plot(history_df["val_loss"], label="val_loss");
plt.legend();

Training and validation loss.

y_pred = []  # store predicted labels
y_true = []  # store true labels
X_valid = [] # store the image

for image_batch, label_batch in valid_dataset:
    X_valid.append(image_batch)
    
    y_true.append(label_batch)
    # compute predictions
    preds = model.predict(image_batch)
    # append predicted labels
    y_pred.append(preds)

# convert the true and predicted labels into tensors
correct_labels = tf.concat([item for item in y_true], axis = 0)
predicted_labels = tf.concat([item for item in y_pred], axis = 0)
images = tf.concat([item for item in X_valid], axis = 0)

cm = confusion_matrix(correct_labels, predicted_labels > 0.5, normalize='all')
ConfusionMatrixDisplay(cm, display_labels=["No Pollen", "Pollen"]).plot()

From the confussion matrix we can see that our model do not have false positives. There some false negatives but in general our pollen model is very accurate. Also, we can see that our validation dataset is unbalanced where 76% of the data belongs to No pollen class.

Confusion Matrix.

print(classification_report(correct_labels, predicted_labels > 0.5 ))

              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98       846
         1.0       1.00      0.89      0.94       271

    accuracy                           0.97      1117
   macro avg       0.98      0.94      0.96      1117
weighted avg       0.97      0.97      0.97      1117

Check Predictions

random_idx = np.random.permutation(len(images))
random_idx = random_idx[:32]
fig, ax = plt.subplots(4, 8, figsize=(20, 15))
axes = ax.ravel()

for i, idx in enumerate(random_idx):
    axes[i].imshow(images[idx])
    true_label = "Pollen" if correct_labels[idx] > 0.5 else "No Pollen"
    pred_label = "Pollen" if predicted_labels[idx] > 0.5 else "No Pollen"
    
    title = true_label + pred_label
    axes[i].set_title("True: {}".format(true_label))
    axes[i].set_xlabel("Pred: {}".format(pred_label))
    axes[i].set_xticks([])
    axes[i].set_yticks([])

Random examples.

Check Hard Cases

To plot the hard cases we sorted the errors in descending order and plot the top 32 images with greater error. Plotting the hard cases we can see our model false negatives. Some of the examples seems hard even for humans.

errors = (correct_labels - predicted_labels)**2
hard_cases_indxes = tf.argsort(errors, direction="DESCENDING", axis=0)
hard_cases_indxes = tf.reshape(hard_cases_indxes, -1)

fig, ax = plt.subplots(4, 8, figsize=(20, 15))
axes = ax.ravel()

for i, idx in enumerate(hard_cases_indxes[:32]):
    axes[i].imshow(images[idx])
    true_label = "Pollen" if correct_labels[idx] > 0.5 else "No Pollen"
    pred_label = "Pollen" if predicted_labels[idx] > 0.5 else "No Pollen"
    
    title = true_label + pred_label
    axes[i].set_title("True: {}".format(true_label))
    axes[i].set_xlabel("Pred: {}".format(pred_label))
    axes[i].set_xticks([])
    axes[i].set_yticks([])

Hard cases examples.

Conclusion

We trained our pollen model using the Tensorflow/Keras framework. We obtained a very accurate model without any false positive case on the validation dataset, but with few false negatives examples. Some of these false negatives examples are hard even for humans.

Characterizing Mapping Quality Recalibration Approaches in a Variant Graph Genomics Tool

2018-08-16T00:00:00+00:00

This was my research project as part of my summer intership with the BD2K program at UC Santa Cruz Genomics Institute. I was under the supervision of Bendict Paten and Adam Novak. For more details go to the github page.

Motivation

Identifying DNA patterns can tell us useful information about any living being. Closely related organisms have similar DNA, while distantly related organisms have few similarities. Humans have extremely similar genomes; studying differences can help to identify particular variants that can cause illness. Vg is a variant graph-based alignment tool for DNA mapping using graph genome references; these graphs capture variation information from populations, which allows more accurate genome studies.

vg

Vg is a set of tools for working with genome variation graphs. These graphs consist of a set of nodes and edges, where each node represents a DNA sequence; edges, connections between two nodes, can be seen as concatenations of two sequences. We built VG graphs with genome references and their sequence variations. Because of the variation we have multiple paths through the graph, this means from a particular sequence you could have multiple edges to take.

VG Graph.

An essential part of vg is mapping DNA reads into the graph; that means searching for the position where the sequence is most similar to the reference graph. Mapping is challenging because genomes can be very repetitive, and in each repetition, sequences can vary. In addition to mapping, vg calculates a mapping quality score; this score is the probability of the mapping being wrong.

Graph image was created using Sequence TubeMap Tool

Approach

In this work, we create and benchmark models to predict the probabilities of mappings being wrong and compare our recalibration models against each other and against the original mapping quality scores. To build our dataset, we simulate sequences with errors from the reference graph and map these new sequences back into the graph, then label those mappings as correct or incorrect. We train our models to calculate when a mapping is wrong, then extract the probabilities from those predictions. Using these probabilities, we calculate mapping quality scores and compare them against the original scores calculated by vg using the Brier score.

VG recalibration model training workflow.

Discussion

We test 5 different models with logistic regression using mapping quality information, mems, sequences, mems stats and a combination between mems and sequences. Our experiments show that logistic regression with mems improves by 5.23% the original mapping score given by vg in reads of length 100 base pairs but is not able to generalize well across lengths. But the Q-Q plot shows that the mems model has over confidence about its predictions.

References

Garrison, Erik, et al. “Sequence variation aware genome references and read mapping with the variation graph toolkit.” bioRxiv (2017): 234856.