Cristian Brokate

Application revenue prediction

2024-10-07T00:00:00+00:00

Introduction

In the competitive landscape of the digital applications industry, maximizing revenue generation is a paramount objective. This exercise aims to develop a predictive model capable of forecasting application-generated revenue by leveraging a combination of user features and engagement metrics. By understanding user characteristics and their interactions with the application, we can gain valuable insights into revenue potential and optimize monetization strategies.

Data

The dataset provides information about users application installation features and also the engagement features through 120 days. The dataset is divided in 2 parts:

The train data, which includes 1.7 millions of lines.
The test data, which includes 12.5 thousands of lines.

Each line contains information about an user. The train dataset contains information of users that installed the application from February until the end of November.The test dataset contains information from users that installed the application the 1st of December.

Individual user features

These informations are gather the first day that the user installs the application.

Install date

This represents the day when the users install the application. We can see that there has been a tendency increase from February until December. There are some periods where the installations have been higher than other like in the first part of May or October. However there are some outlayer days, where the installations have been extremely low, like August 16th, September 22th, or November 2nd, 4th or 12th.

One can also notice that the install day has a weekly seasonality, more install occurs during the weekend.

Platform

In this dataset applications are mainly installed in IOS systems.

Personalised ads

This represents whether the user opted in for personalized ads or other services. The train dataset have a higher number of users that didn’t choose for personalized ads. In the test dataset, the repartition is almost at the same level.

App Id

There are two application that have been installed. Both in the train and the test set.

Country

This represents the country where the app has been downloaded. We can see that the installations are mainly in the US market.

Ad Network

This represents the ID of the ad network that displayed the ads to the user. There are 3 main ad networks in the train and the test set.

Campaign type

In marketing terms, this is category of the ad campaign that was used to acquired the user.

Campaign id

There are more than 178 campaign_ids, counting some of them which are unique.

index	campaign_id	count
0	...	559244
1	da2...	317648
2	99a...	203638
3	c6d...	114357
4	281...	70904
...	...	...
174	blo...	1
175	gam...	1
176	blo...	1
177	dow...	1
178	ぶろっ...	1

A common technique to deal with this behaviour is to group the less common ones. This can be achieved with the following python function.

def get_most_popular(series, threshold=None, top_values=None):
    if threshold is not None:
        popular_idx = series.value_counts().to_frame().query('count > @threshold').index.tolist()
        new_series = series.copy()
        new_series.loc[:] = "other"
        new_series.loc[series.isin(popular_idx)] = series
        return new_series
    if top_values is not None:
        popular_idx = series.value_counts().to_frame().head(top_values).index.tolist()
        new_series = series.copy()
        new_series.loc[:] = "other"
        new_series.loc[series.isin(popular_idx)] = series
        return new_series

The threshold can be changed in order to have a number of categories that are representative for a group.

	campaign_id	count
0	...	559244
1	da...	317648
2	99...	203638
3	c6...	114357
4	28...	70904
...	...	...
96	20...	11
97	ブロ...	11
98	MA...	10
99	xx...	9
100	17...	8

Model

The model of the devices also contains several occurrences, with ids that sometimes are unique, so applying the same grouping function, one could obtain the top 100 models. Following the platform repartition, we find Iphones and Ipads in the top devices. One should notice that IPhoneUnkown and IPadUnknown are not in test set.

Manufacturer

This represents the manufacturer of the user’s device. The main manufacturer is Apple, followed by Samsung and Google. The repartition of manufacturer follows the same distribution for the train and the test datasets.

Mobile classification

This classification can be related with monetary value of the mobile. High end devices have the best classification Tier 1, like the last Iphone or Iphone. The cheapest telephones will be on Tier 5. Some of them doesn’t have classification.

City

The city where the user downloaded the app. The most popular cities are Tokyo, Chicago and Houston. There are come Japanese cities that doesn’t appear in the test dataset like Otemae, Tacoma or Nishikicho.

Other variables

There are some variables that are not useful for revenue predictions such as ` game_type and user_id` since they are all unique values.

User engagement features

During the life of the application, the user might interact by clicking through ads or by buying things on the platform. These features are measures using a cumulative value for day: 0, 3, 7, 14, 30, 60, 90 and 120.

Revenues generated by the application

The main revenues are measured by the variable dX_rev, where X is the number of days where it’s measured. This is variable that is going to be predicted.The distribution is similar for test and train dataset. This variable is treated in log form since it variates mostly between 0 and 1. 70.0% of users produces less than $1 of revenue.

This variable is decomposed in dX_iap_rev and dX_ad_rev which are the revenues for ads and purchased items respectively. It is possible to notice that the total revenue is mainly pushed by ads.

Correlation of values

The correlation matrix allows to find correlated features.

The following variables are correlated:

“iap_ads_rev_d0” and “iap_ads_count_d0”: Which means that the revenue from ads in correlated with the number of ads.
“iap_coins_count_d0” and “iap_count_d0”: Which means that the number of coins bought by the user is correlated with the number of items bought by the user.
“iap_coins_rev_d0” and “d0_iap_rev” and “d0_rev”: Which means that revenue from coin purchases is correlated with the revenue from in app purchases and the total revenue.

The correlated values can be removed from the modelling stage.

Evolution over time

The cumulate value changes during the days. However, in the following figure, we can notice that the revenue distribution remains the same. The distribution reduces because there is less data with d120_rev.

Missing values

The objective of the problem is to predict revenue on the day 120, but this value is available for every line in the dataset because we don’t have the information from the future. For instance, the test dataset is from 1 December, so we only have information for day zero.

Modelling

To predict revenue at day 120, a traditional approach can be employed using structured data as the training dataset and the d120_rev column as the target variable. This approach leverages the existing information in the dataset to forecast future revenue.

Feature selection

Based on variable exploration, the following modifications were made to user installation features:

Categorical values such as campaign_id, model, manufacturer, and city were consolidated by grouping less frequent categories together.
Utilizing the installation_date for each user, additional features were derived, including the number of installations per day, the day of the week, and the month.
To highlight the distinct nature of Apple devices, a binary column was introduced to indicate whether the installation was made on an Apple device.

X = (
    df
    .assign(
        campaign_id=lambda x: get_most_popular(x['campaign_id'], threshold=THRESHOLD_POPULAR_CAMPAING),
        model=lambda x: get_most_popular(x['model'], threshold=THRESHOLD_POPULAR_MODEL),
        manufacturer=lambda x: get_most_popular(x['manufacturer'], top_values=TOP_MANUFACTURER),
        city=lambda x: get_most_popular(x['city'], top_values=TOP_CITIES),
        is_optin=lambda x: x['is_optin'].replace({1:"optin", 0: "not_optin"}),
        installations_perday=lambda x: x['install_date'].dt.strftime('%Y-%m-%d').to_frame().merge(installations_perday, how='left', left_on='install_date', right_index=True)[0],
        installation_day=lambda x: x['install_date'].dt.day_name(),
        installation_month=lambda x: x['install_date'].dt.month,
        is_apple = lambda x: x['manufacturer'].str.contains('apple', flags=re.IGNORECASE, na=False).replace({True:"is_apple", False: "not_apple"}),
        mobile_classification = lambda x: x['mobile_classification'].replace(r'^\s*$', 'unkown_mobile_classification', regex=True),
    )
    .assign(**{col: lambda x, col=col: x[col].astype('category') for col in user_cols})
    .get(engagement_cols + user_cols + [target_col, 'dataset'])
)

To ensure that each class is treated equally and to accommodate the requirements of models that cannot handle categorical variables directly, certain columns were transformed using one-hot encoding. This technique converts categorical values into numerical representations, enabling the model to process them effectively.

one_hot_columns = ['is_optin', 'platform', 'app_id', 'country', 'ad_network_id', 'campaign_type', 'mobile_classification']
X = pd.get_dummies(data = X, prefix = 'OHE', prefix_sep='_',
               columns = one_hot_columns,
               drop_first =False,
              dtype='int8')

While the numerical values were left intact, normalization might be considered to improve model performance. Correlated features were removed based on exploratory data analysis. To ensure realistic predictions on day zero, information from features like (d3, d7, d14, …) was excluded from the training dataset. This prevents the model from over-relying on these features and improves its ability to make accurate predictions in cases where such information is unavailable.

Train test split

The dataset was randomly divided into training, testing, and validation sets, with proportions of 60%, 20%, and 20%, respectively. The training and validation sets were used during the model training phase to optimize hyperparameters and evaluate performance. The testing set was reserved for final model evaluation and comparison with other models.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

Model selection

Gradient boosting models have demonstrated exceptional performance in similar Kaggle competitions, making them well-suited for this task. I selected three popular gradient boosting models: XGBoost, CatBoost, and LightGBM. To predict the revenue value, I employed the regression version of these models. The RMSE metric was chosen to monitor the models’ performance throughout the training process.

The following figure illustrates the evolution of the loss function for both the training and validation datasets. Training was halted when the validation loss stopped decreasing to prevent overfitting.

In order to find the best parameters of the model I used the open-source python library hyperopt. Hyperopt is used for hyperparameter optimization of machine learning models. It allows you to define the search space for your parameters and the optimization strategy. Hyperopt then tries a certain number of configurations to find the best possible set of parameters for your model.

In the context of gradient boosting models, some of the parameters you might want to tune using Hyperopt include:

The number of leaves and the maximum depth of the decision trees.
The min_child_weight parameter, which stops the algorithm from splitting a node further if the number of samples in that node falls below a certain threshold.
The learning rate and the number of estimators used by the model.

space={
    'enable_categorical': True,
    'learning_rate': hp.quniform("learning_rate", 0.05, 0.1, 0.05),
    'max_depth': hp.quniform("max_depth", 3, 18, 1),
    'max_leaves': hp.quniform ('max_leaves', 1,9, 1),
    'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
    'n_estimators': 2000,
    'early_stopping_rounds': 5,
    'device': "cuda:0",
    'seed': 0
    }


def objective(space):
    clf=XGBRegressor(
         enable_categorical= True,
        learning_rate=space["learning_rate"],
        max_depth=int(space["max_depth"]),
        max_leaves=int(space['max_leaves']),
        min_child_weight=int(space['min_child_weight']),
        n_estimators=int(space['n_estimators']),
        early_stopping_rounds=int(space['early_stopping_rounds']),
        device= "cuda:0",
        seed=0
    )
        
    evaluation = [( X_train, y_train), ( X_test, y_test)]

    clf.fit(X_train, y_train, 
            eval_set=[(X_train, y_train), (X_val, y_val)],
            verbose=False)
    
    pred = clf.predict(X_test)
    mae = mean_absolute_error(pred, y_test)
    rmse = root_mean_squared_error(pred, y_test)
    print(f"mae: {mae}, rmse: {rmse}")
    return {'loss': rmse, 'status': STATUS_OK }
    
trials = Trials()
best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 100,
                        trials = trials)

The predictions from the three models have been combined using a weighted average.

Results

One of the advantages of gradient boosting models is their interpretability. These models can identify the most influential variables in making predictions. The following figure presents the feature importance for the XGBoost model. The installations per day feature is the most significant, providing valuable context about the day the user installs the application. Another important variable is the city where the user downloaded the app, as it implies information about the user’s demographics. Engagement variables, such as total revenue from ads, are also crucial for the final decision.

The RMSE of the model on the test dataset is 0.748. Since the target variable was transformed using a logarithmic function, this RMSE corresponds to a predicted error of approximately $1.11, which represents the most common error for each use case prediction. The following figure illustrates the distribution of the actual target values in the test dataset and the predicted values generated by the model. The model tends to predict values between 0.1 and 1. It struggles to accurately predict the behavior of the target variable in the range between 0.001 and 0.01.

Conclusion

This study aimed to develop a model to predict user-generated revenue within a digital application. By leveraging user installation features and engagement metrics, the model can provide valuable insights for optimizing monetization strategies.

The analysis revealed several key factors influencing revenue generation. Features like installation day, user city, and total ad revenue were identified as the most impactful on the model’s predictions.

The XGBoost model achieved an RMSE of 0.748 on the test dataset, which translates to a predicted error of approximately $1.11 for most user cases. However, the model exhibits limitations in accurately predicting low-revenue users (target values between 0.001 and 0.01).

Overall, this study demonstrates the potential of gradient boosting models for user revenue prediction within digital applications. By incorporating additional features and refining the model further, one can potentially improve prediction accuracy and gain deeper insights into user behavior that can be harnessed for effective revenue generation strategies.

Personalized Plan Care Information

2024-07-28T00:00:00+00:00

Using RAG and LLM to provide accurate information about plant care.

Music playlists dashboard

2024-02-23T00:00:00+00:00

Music has a significant impact on the world in various ways. Gaining insight into the patterns of popular music can be a fascinating endeavor. In this post, we will demonstrate how to utilize Spotify’s trending music to stay up-to-date with current trends in a self-hosted manner.

Stable diffusion: realistic photo of yoda as a DJ from behind listening music in front of a playlist dashboard in a big screen

Data

Spotify is a widely used music service, and its playlist data is publicly available on the internet. There are several popular trending playlists that reflect current music preferences. By utilizing GitHub Actions, you can automatically fetch this data at regular intervals and store it without needing to manage a database. This data can then be easily accessed using straightforward HTTP requests directly to GitHub. I utilize the project called spotify-downloader to download the playlist data and save it as a file. Here is the follwing code snippet to do it:

def download(key, url):
    cmd=f"docker run --rm -v {CWD}/tmpplaylists:/music spotdl/spotify-downloader save {url.strip()} --save-file {key}.spotdl"
    p=subprocess.Popen(cmd.split(" "),
                             stderr=subprocess.STDOUT,
                             stdout=subprocess.PIPE)
    for line in iter(p.stdout.readline, b''):
        print(f">>> {line.rstrip().decode('utf-8')}")

Then I parse every playlists file and do simple preprocessing on the artists column in order to obtain a list of every artists that participates in the songs.

def read_data():
    appended_data = []
    cols = ['name', 'artists', 'album_name', 'date', 'song_id', 'cover_url', 'playlist', 'position']
    for f in glob.glob('tmpplaylists/*.spotdl'):
        data = pd.read_json(f).assign(
                artists=lambda x: x['artists'].explode().str.replace("'","").str.replace("\"", "").reset_index().groupby('index').agg({'artists': lambda y: y.tolist()}),
                playlist=f.split("/")[1].split(".")[0],
                position=lambda x: x.index + 1
                )
        assert len(set(cols).difference(data.columns)) == 0, f'Columns: {", ".join(data.columns)}'
        assert len(data) > 0, f"Shape {data.shape[0]} and {data.shape[1]} columns"
        appended_data.append(data)
    (
            pd.concat(appended_data, ignore_index=True)
            .get(cols)
            .to_csv('static/data/data.csv', index=False, header=True, sep=";")
    )

By employing periodic GitHub Actions, it is possible to regularly save playlist positions every week, enabling further processing of this data through other tools.

Observable dashboard

I utilize the Observable framework, which incorporates the D3 JavaScript library for generating swift and adaptable visualizations.

Observable Notebook combines the features of conventional text editors, code editors, and document processors into a unified interface, simplifying the creation of rich and dynamic documents that integrate text, code, data visualization, and other multimedia elements.

Observable employs the concept of “cells” to arrange content within a notebook, where each cell can either contain plain text or executable code written in JavaScript or any other supported language. Cells can be rearranged, grouped, and nested, enabling the creation of hierarchical structures that reflect the logical organization of the document.

One can write a markdown notebook and import data from multiple languages, for example I use a python preprocessing pipeline, then I import the data in the notebook and plot it using the available visualizations functions.

# Playlist details

const commit_date_old = Array.from(new Set(diffData.map(i => i.commit_date)))[1];
const commit_date_recent = Array.from(new Set(diffData.map(i => i.commit_date)))[0];

From ${commit_date_old} to ${commit_date_recent} new songs have been added to the playlist.

const playlistsNames = bestArtists.map(i => i.playlist)
const playlistChoosen = view(Inputs.select(new Set(playlistsNames), {value: playlistsNames[0], label: "Playlists"}));
const artistsNames = bestArtists.map(i => i.artists)

const tableRows = RecentSongAdds(diffData, playlistChoosen, commit_date_old, commit_date_recent)

 class="card" style="margin: 1rem 0 2rem 0; padding: 0;">
  ${Inputs.table(tableRows, {
  columns: ["position", "artists", "name", "album_name", "attribute"],
  align: {"position": "left"},
  format: {
    attribute: (x) => x == "+" ? "New!" : x == "-" ? "🗑" : x > 0 ? `⬆${x}` : x == 0 ? '--' : `⬇${Math.abs(x)}`
  }
})}


 class="grid grid-cols-1" style="grid-auto-rows: 560px;">
   class="card">
    ${BestArtistsPlot(bestArtists, playlistChoosen)}
  




const mostPopularArtists = view(Inputs.select(mostFrequent(bestArtists.filter(i => i.playlist == playlistChoosen).map(i => i.artists)).slice(0,10), {value: artistsNames[0], label: "Popular artists"}));

 class="grid grid-cols-1" style="grid-auto-rows: 560px;">
   class="card">
    ${BestSongsPlot(bestArtists, playlistChoosen, mostPopularArtists)}

The dashboard is hosted on GitHub pages, the link is available at cristianpb.github.io/playlists.

Analysis

The dashboard allows for the identification of patterns in the development of Spotify playlists over time. The Today Top Hits playlist reflects global music trends, having garnered more than 34 million likes at the time of writing this article.

We can observe artists such as Olivia Rodrigo, who has multiple tracks featured in the “Today’s Top Hits” playlist. Some songs exhibit a consistent pattern, indicating that they have maintained popularity and catchiness over time, for example, “The Vampire Song,” which remained among the top 35 songs for more than four months. Conversely, other tracks like “Catch Me Now” may initially appear in the playlist due to the artist’s popularity but subsequently decline in ranking during subsequent weeks.

One might also observe that artist-specific radio playlists, which are frequently updated, exhibit minimal fluctuations. For instance, “Muse Radio,” “Coldplay Radio,” and “The Strokes” playlists undergo infrequent changes.

Discusion

Observable is a practical platform for crafting data analyses, offering versatile connectors and support for multiple programming languages. The variety of available visualizations is crucial, and comprehensive documentation plays a significant role in guiding users to create effective visualizations.

However, incorporating reactive filters or reusing variables within an Observable notebook necessitates writing JavaScript code, which may be a drawback for some users. Although the reactivity of Observable notebooks is functional, it might not be the most advanced option available.

The code to process the data and build the dashboard is available at github.com/cristianpb/playlists.

Log analysis using Fluentbit Elasticsearch Kibana

2023-09-11T00:00:00+00:00

Logs are a valuable source of information about the health and performance of an application or system. By analyzing logs, you can identify problems early on and take corrective action before they cause outages or other disruptions.

One way to analyze logs is to use a tool like Fluent Bit to collect them from different sources and send them to a central repository like Elasticsearch. Elasticsearch is a distributed search and analytics engine that can store and search large amounts of data quickly and efficiently.

Once the logs are stored in Elasticsearch, you can use Kibana to visualize and analyze them. Kibana provides a variety of tools for exploring and understanding log data, including charts, tables, and dashboards.

By analyzing logs using Fluent Bit, Elasticsearch, and Kibana, you can gain valuable insights into the health and performance of your applications and systems. This information can help you to identify and troubleshoot problems, improve performance, and ensure the availability of your applications.

Log injection architecture

Fluent-bit: log injection

Traefik, a modern reverse proxy and load balancer, generates access logs for every HTTP request. These logs can be stored as plain text files and compressed using the logrotation Unix utility. Fluent Bit, a lightweight log collector, provides a simple way to insert logs into Elasticsearch. In fact, it provides several input connectors for other sources, such as syslog logs, and output connectors, such as Datadog or New Relic.

To send Traefik access logs to Elasticsearch using Fluent Bit, you will need to:

Install Fluent Bit on the machine where Traefik is running.
Configure Fluent Bit to collect the Traefik access logs.
Configure Elasticsearch to receive the logs from Fluent Bit

The configuration file from fluent bit has the following sections:

Input: I use the tail connector to fetch data from access.log file
Filter: I use MaxMind plugin to geocode IP adresses.
Output: Points directly to elasticsearch database.

The following is configuration file shows how to collect Traefik logs and send them to Elasticsearch:

# fluentbit.conf
[SERVICE]
    flush             5
    daemon            off
    http_server       off
    log_level         info
    parsers_file      parsers.conf

[INPUT]
    name              tail
    path              /var/log/traefik/access.log,/var/log/traefik/access.log.1
    Parser            traefik
    Skip_Long_Lines   On

[FILTER]
    Name                  geoip2
    Match                 *
    Database              /fluent-bit/etc/GeoLite2-City.mmdb
    Lookup_key            host
    Record country host   %{country.names.en}
    Record isocode host   %{country.iso_code}
    Record latitude host  %{location.latitude}
    Record longitude host %{location.longitude}
    
[FILTER]
    Name                lua
    Match               *
    Script              /fluent-bit/etc/geopoint.lua
    call                geohash_gen

[OUTPUT]
    Name                es
    Match               *
    Host                esurl.com
    Port                443
    HTTP_User           username
    HTTP_Passwd         password
    tls                 On
    tls.verify          On
    Logstash_Format     On
    Replace_Dots        On
    Retry_Limit         False
    Suppress_Type_Name  On
    Logstash_DateFormat all
    Generate_ID         On

I use an additional filter function to produce a geohash record, which then it’ll be used in kibana in geo maps plot.

# geopoint.lua
function geohash_gen(tag, timestamp, record)
        new_record = record
        lat = record["latitude"]
        lon = record["longitude"]
        hash = lat .. "," .. lon
        new_record["geohash"] = hash
        return 1, timestamp, new_record
end

The parser uses a regex expression to obtain the different fields for each record. By default all fields are process as strings, but you can other types, like integer for fields like request size, request duration and number of requests.

# parsers.conf
[PARSER]
    Name   traefik
    Format regex
    Regex  ^(?[\S]*) [^ ]* (?[^ ]*) \[(?[^\]]*)\] "(?\S+)(?: +(?[^\"]*?)(?\S*)?)?" (?[^ ]*) (?[^ ]*)(?: "(?[^\"]*)" "(?[^\"]*)")? (?[^ ]*) "(?[^\"]*)" "(?[^\"]*)" (?[\d]*)ms$
    Time_Key time
    Time_Format %d/%b/%Y:%H:%M:%S %z
    Types request_duration:integer size:integer number_requests:integer



Once you have configured Fluent Bit, you can start it by running the following command: fluent-bit -c fluent-bit.conf or by using docker compose:

# docker-compose.yml
version: "3.7"

services:
  fluent-bit:
    container_name: fluent-bit
    restart: unless-stopped
    image: fluent/fluent-bit
    volumes:
      - ./parsers.conf:/fluent-bit/etc/parsers.conf
      - ./fluentbit.conf:/fluent-bit/etc/fluent-bit.conf
      - ./geopoint.lua:/fluent-bit/etc/geopoint.lua
      - ./GeoLite2-City.mmdb:/fluent-bit/etc/GeoLite2-City.mmdb
      - /var/log/traefik:/var/log/traefik


Elasticsearch: Log indexing

Elasticsearch is a popular open-source search and analytics engine that can be
used for a variety of tasks, including log analysis. It is a good choice for
log analysis because it can be queried using complex queries, and it provides a
REST API to cast queries directly in readable JSON format.

Elasticsearch uses a distributed architecture, which means that it can be
scaled to handle large amounts of data. It also supports a variety of data
types, including text, numbers, and dates, which makes it a versatile tool for
log analysis.

To use Elasticsearch for log analysis, you would first need to index the logs
into Elasticsearch. This can be done using a variety of tools, such as Logstash
or Fluent Bit. Once the logs are indexed, you can then query them using
Elasticsearch’s powerful query language.

Elasticsearch’s query language is based on JSON, which makes it easy to read
and write. It also supports a variety of features, such as full-text search,
regular expressions, and aggregations.

Mappings

Elasticsearch creates a mapping for new indices by default, guessing the type
of each field. However, it is better to provide an explicit mapping to the
index. This will allow you to control the type of each field and the operations
that can be performed on it. For example, you can specify that a field is of
type ip so that it can be used to filter for IP address groups, or you can
specify that a field is of type geo_point so that it can be used to filter by
an specific location.

curl -XPUT "https://hostname/logstash-all" -H 'Content-Type: application/json' -d '{ "mappings": { "properties": { "@timestamp": { "type": "date" }, "agent": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "code": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "country": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "geohash": { "type": "geo_point" }, "host": { "type": "ip" }, "isocode": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "latitude": { "type": "float" }, "longitude": { "type": "float" }, "method": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "number_requests": { "type": "long" }, "path": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "protocol": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "referer": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "request_duration": { "type": "long" }, "router_name": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "router_url": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "size": { "type": "long" }, "user": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }'


Managed database

Bonsai.io is a managed Elasticsearch service that
provides high availability and scalability without the need to manage or deploy
the underlying infrastructure. Bonsai offers a variety of plans to suit
different project requirements.




Bonsai.io dashboard


The hobbyist tier is more than enough for this kind of use case, which comes
with a maximum of 35k documents, 125mb of data and 10 shards. At the moment of
writing this article its a free, you don’t have to enter a credit card to use
it.

In order to be compliant with the limits of the hobby tier, I use the following
cronjob to remove old documents:

curl -X POST "https://hostname/logstash-all/_delete_by_query" -H 'Content-Type: application/json' -d '{ "query": { "bool": { "filter": [ { "range": { "@timestamp": { "lt": "now-10d" } } } ] } } }'


For my use case, 10 days retention is enough to be compliant with the plan
limits.

Kibana: Log analysis

Bonsai.io also provides a managed kibana service connected to the elasticsearch cluster.

There are certain limitations about the stack management, like there is no possibility to manage the index life cycle or alerting capacity.

Nevertheless, it provides basic functionality to create useful dashboards and
discover patterns inside the logs.




Kibana dashboard


Its interesting to see bot request trying to explode vulnerability from
services like wordpress and also bot scrapping services.

Discusion

The following stack provides a simple and cost-effective way to analyze logs.
The computational footprint on your server is very low because most of the
infrastructure is in the cloud. There are many freemium services, such as
Bonsai.io and New Relic, that can be used to ingest and analyze logs.

Observability is important for infrastructure management, but it is also
important to have alerting capabilities to detect and respond to threats.
Unfortunately, these plugins are not typically included in the free plan, so
you will need to upgrade to a paid plan to get them.



Magic wand gesture recognition using Tensorflow and SensiML
2021-06-04T00:00:00+00:00
During the last decade, IoT devices have become very popular. 
Their small factor size made they optimal for all kind of applications. 
Their technology has also improve in the last decade and now a days they are
are able to do machine learning in the edge.

I recently received a QuickFeather microcontroller from a
Hackster.IO contest. One of
the main features of this device is its built-in eFPGA, which can optimize
parallel computations on the edge.

This post will explore the capabilities of this little beast and show how to
run a machine learning model that was trained using Tensorflow.
The use case will be focused for gesture recognition, so the device will be
able to detect if the movement correspond to one alphabet letter.

QuickFeather

The QuickFeather is a very powerful device with a small form factor (58mm
x 22mm). It’s the first FPGA-enabled microcontroller to be fully supported with
Zephyr RTOS. Additionally it includes a MC3635 accelerometer, a pressure, a
microphone and an integrated Li-Po battery charger.

Unlike other development kits which are based on proprietary hardware and
software tools, QuickFeather is based on open source hardware and is built
around 100% open source software. QuickLogic provides a nice
SDK to flash some FreeRTOS
software and get started. There is a bunch of documentation and examples in
their github repository.

Since the QuickFeather is optimized for battery saving use cases, it doesn’t
include neither Wi-Fi nor Bluetooth connectivity. Therefore, the data can be
only transferred using UART serial connection.

Capture data

The on-board accelerometer is the main sensor for this use case. I use a
USB-serial converter in order to read data directly from the accelerometer and
transfer it to another host that is connected to the other end of the usb
cable.

Data is captured and analysed using another machine. I personally connected a
raspberry pi, which has also a small form factor, in order to have flexibility
when performing the different gestures.

SensiML provides a web application to visualize and save data.
This application is a python application that runs a flask webserver and
provides nice functionalities such as capturing video at the save time in order
to correlate to saved data.
The code is available on github, so one can see how the code works and even
propose some modifications, like I did.

I captured data from O, W and Z gestures as you can see in the following picture:




Data capture using open-gateway application


Label data with Label Studio

Once data is collected one need to label it so that one can teach a machine
learning model how to associate a certain movement with a gesture. 
I used Label Studio, which is a open source data
labelling tool. It can be used to label different kind of data such as image,
audio, text, time series and a combination of all the precedent.

It can be deployed on-premise using a docker image, which is very handy if you
want to go fast.

Setup

Once Label Studio stars, it has to be configured for a label task. For this case, the
label task corresponds to a time series data. One can chose a graphical
configuration using preconfigured templates or you can customized your self
with some kind of html code. Here is the code I use to configure the data
coming from X, Y and Z accelerometers.


  
   name="label" toName="ts">
     value="O" background="red"/>
     value="Z" background="green"/>
     value="W" background="blue"/>
  
  
   name="ts" valueType="url" value="$timeseriesUrl" sep="," >
     column="AccelerometerX" strokeColor="#1f77b4" legend="AccelerometerX"/>
     column="AccelerometerY" strokeColor="#ff7f0e" legend="AccelerometerY"/>
     column="AccelerometerZ"  strokeColor="#111111" legend="AccelerometerZ"/>
  



Label Studio has a nice preview feature, which shows how the labelling task will look with
the supplied configuration. The following screenshot shows how the interface
looks like for the setup process.




Label Studio setup configuration


Labelling

One of the nicest things from Label Studio is the fact that one can go really
fast using the keyboard shortcuts. It also provides some machine learning
plugins which make predictions with the partial labelled data.
The following screenshot shows the interface for some labelled data.




Data labelled using Label Studio


From a machine learning perspective, the exported data should be a csv file with four different columns. Even is Label Studio is able to export in csv, it didn’t have the right format for me, instead it looks like the following:

timeseriesUrl,id,label,annotator,annotation_id
/data/upload/W.csv,3,"[{""start"": 156, ""end"": 422, ""instant"": false, ""timeserieslabels"": [""W""]}, ... ]",admin@admin.com,3
/data/upload/Z.csv,2,"[{""start"": 141, ""end"": 419, ""instant"": false, ""timeserieslabels"": [""Z""]}, ...]",admin@admin.com,2
/data/upload/O.csv,1,"[{""start"": 77, ""end"": 389, ""instant"": false, ""timeserieslabels"": [""O""]}, ...]",admin@admin.com,1


So I decided to export labels in json format and then build a python script to
transform and combine them all.  The following script transforms three json files
from Label Studio into a single file with 4 columns AccelerometerX,
AccelerometerY, AccelerometerZ and Label.

import numpy as np
import pandas as pd

df_all = pd.DataFrame()
LABELS = ['W', 'Z', 'O']
sensor_columns = ['AccelerometerX','AccelerometerY', 'AccelerometerZ', 'Label']

for ind, label in enumerate(LABELS):
    df = pd.read_csv(f'{label}/{label}.csv')
    events = pd.DataFrame(pd.read_json('WOZ.json')['label'][ind])

    df['Label'] = 0
    for k,v in events.iterrows():
        for i in range(v['start'], v['end']):
            df['Label'].loc[i] = v['timeserieslabels'][0]
    df['LabelNumerical'] = pd.Categorical(df.Label)

    df[sensor_columns].to_csv(f'{label}/{label}_label.csv', index=False)
    df_all = pd.concat([df_all, df], sort=False)

df_all[sensor_columns].to_csv(f'WOZ_label.csv', index=False)


The resulting data can be directly used as a time series data and a machine
learning model can be trained in order to recognise the patterns automatically.
The following picture shows data for W, O and Z patterns.




Labelled data for W, O and Z gestures


Training a model using SensiML

SensiML provides a python package to build a data pipeline which can be used to
train a machine learning model. One need to create a free account in order to
use it. There is a lot of documentation and examples
available online.

Pipeline

Pipelines are a key component of the SensiML workflow. Pipelines store the
preprocessing, feature extraction, and model building steps.

Model training can be done using either SensiML cloud or using Tensorflow to
train the model locally and the uploading it to SensiML in order to obtain the
firmware code to run on the embedded device.

In order to train the model locally, one needs to build a data pipeline to process data and calculate the feature vector.
This is done using the following pipeline:

  The Input Query function which specifies what data is being fed into the model
  The Segmentation which specifies how the data should be feed to the classifier.
  Windowing segmented which captures data depending on gesture expected length.
  The Feature Generator which specify which features should be extracted from the raw time-series data
  The Feature Selector which selects the best features. In this case, we are using the custom feature selector to downsample the data.
  The Feature Transform which specifies how to transform the features after extraction. In this case, it is to scale them to 1 byte each


Here is the python code for the pipeline

dsk.pipeline.reset()
dsk.pipeline.set_input_data('wand_10_movements.csv', group_columns=['Label'], label_column='Label', data_columns=sensor_columns)

dsk.pipeline.add_segmenter("Windowing", params={"window_size": 350, "delta": 25, "train_delta": 25, "return_segment_index": False})

dsk.pipeline.add_feature_generator(
    [
        {'subtype_call': 'Statistical'},
        {'subtype_call': 'Shape'},
        {'subtype_call': 'Column Fusion'},
        {'subtype_call': 'Area'},
        {'subtype_call': 'Rate of Change'},
    ],
    function_defaults={'columns': sensor_columns},
)

dsk.pipeline.add_feature_selector([{'name':'Tree-based Selection', 'params':{"number_of_features":12}},])

dsk.pipeline.add_transform("Min Max Scale") # Scale the features to 1-byte


TensorFlow model

I use the TensorFlow Keras API to create a neural network. This model is very simplified because not all Tensorflow functions and layers are available in the microcontroller version.  I use a fully connected network to efficiently classify the gestures. It takes in input the features vectors created previously with the pipeline (12).

from tensorflow.keras import layers
import tensorflow as tf

tf_model = tf.keras.Sequential()

tf_model.add(layers.Dense(12, activation='relu',kernel_regularizer='l1', input_shape=(x_train.shape[1],)))
tf_model.add(layers.Dropout(0.1))
tf_model.add(layers.Dense(8, activation='relu', input_shape=(x_train.shape[1],)))
tf_model.add(layers.Dropout(0.1))
tf_model.add(layers.Dense(y_train.shape[1], activation='softmax'))

# Compile the model using a standard optimizer and loss function for regression
tf_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


Training

The training is performed by feeding the neural network with the dataset by
batches of data. For each batch of data a loss function is computed and the
weights of the network are adjusted.  Each time it loops through the entire training set, then is
called an epoch. In the following picture:

  at the top left we can see the evolution of the loss function, it decreased, meaning that it converges to a optimal solution.
  at the bottom left we can see the evolution of the accuracy of the model, it increases!
  at the right we have the confusion matrix for the validation and train set.





Model training performance


The confusion matrix provides information not only about the accuracy but also
about the kind of errors of the model. It’s often the best way to understand
which classes are difficult to distinguish.

Once you are satisfied with the model results, it can be optimized using
Tensorflow quantize function. The quantization reduces the model size by
converting the network weights from 4-byte floating point values to 1-byte
unsigned int8. Tensorflow provides the following built-in tool:

# Quantized Model
converter = tf.lite.TFLiteConverter.from_keras_model(tf_model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.representative_dataset = representative_dataset_generator
tflite_model_quant = converter.convert()


There are more benefits by quantizing the model for Cortex-M processors like
the Quickfeather, which uses some instructions that gives a boost in performance.

The quantized model can be uploaded to SensiML in order to obtain a firmware to
flash to the QuickFeather.  One can download the model using the jupyter
notebook widget or in sensiml cloud application.
There are two available formats:

  binary: this can be flashed directly to the QuickFeather. The results
are transferred using serial output.
  library: this is a knowledgepack form, which can be used in Qorc SDK
to compile. There is more flexibility for this option, because one can
modify source code before compiling.


Export model to Quickfeather

The knowledgepack can be customized in order to light the QuickFeather led with
a different colour depending on the prediction made.
This can be done by adding the following function to the src/sml_output.c file.

// src/sml_output.c
static intptr_t last_output;

uint32_t sml_output_results(uint16_t model, uint16_t classification)
{

    //kb_get_feature_vector(model, recent_fv_result.feature_vector, &recent_fv_result.fv_len);

    /* LIMIT output to 100hz */

    if( last_output == 0 ){
        last_output = ql_lw_timer_start();
    }

    if( ql_lw_timer_is_expired( last_output, 10 ) ){
        last_output = ql_lw_timer_start();

        if ((int)classification == 1) {
            HAL_GPIO_Write(4, 1);
        } else {
            HAL_GPIO_Write(4, 0);
        }

        if ((int)classification == 2) {
            HAL_GPIO_Write(5, 1);
        } else {
            HAL_GPIO_Write(5, 0);
        }

        if ((int)classification == 3) {
            HAL_GPIO_Write(6, 1);
        } else {
            HAL_GPIO_Write(6, 0);
        }
    	sml_output_serial(model, classification);
    }
    return 0;
}


Finally the model can be compiled using Qorc SDK and flashed again to the QuickFeather.

Test model using Quickfeather

One can use a Li-Po battery with the battery connector of the QuickFeather in order to have complete autonomy.
Then using a nice spoon like the following one can improvise a magic wand 🪄:




QuickFeather as a magic wand


The following video shows the recognition system in action, the colours mean he following:

  red for O gesture
  green for W gesture
  blue for Z gesture




  
    Your browser doesn't support HTML5 video.
  



Conclusions

QuickFeather is a device completely adapted for tiny machine learning models.
This use case provides a simple example to demystify the whole workflow for
implementing machine learning algorithms to microcontrollers, but it can be
extended for more complex use cases, like the one provided in the Hackster.io
Climate Change
Challenge.

SensiML provides provides nice tools to simplify machine learning
implementation for microcontrollers. They provide software like Data Capture
Lab, which capture data and
also provides a labelling module. However, for this case I prefer to use Label
Studio, which is more generic tool, that works for most use cases.

The notebook with the complete details about the model training can be found in this gist.


TheFifthDriver: Machine learning driving assistance on FPGA
2020-12-07T00:00:00+00:00
FPGA implementation of a highly efficient real-time machine learning driving assistance application using a camera circuit.


Hydroponic Agriculture Learning with SensiML AI Framework
2020-12-07T00:00:00+00:00
New methodologies of horticulture based-on high-end technology are urgently required to transform the way in which the world is fed. In this project, we present the results of a hydroponic agriculture PoC, which was developed using Quicklogic's QuickFeather in conjuntion with SensiML to highlight the enormous benefits that the growth of crops without soil brings to the climate change.


Writing notes with Vimwiki and Hugo static generator
2020-07-01T00:00:00+00:00
Taking notes is important when you want to remember things. I used to have
notebooks, which worked fine, but it’s complicated to search things when you
have a lot of things. As a die hard Vim user, I decided to give a try to
Vimwiki to help me organize my notes and ideas.

Vimwiki makes easy for you to create a personal wiki using the Vim text editor.
A wiki is a collection of text documents linked together and formatted with
plain text syntax that can be highlighted for readability using Vim’s
syntax highlighting feature.

The plain text notes can be exported to HTML, which improves readability. In
addition, it’s possible to connect external HTML converters like Jekyll or
Hugo.

In this post I will show the main functionalities of Vimwiki and how to
connect the Hugo fast markdown static generator.


  
    
    
    
      Vimwiki notes writing in Vim
    
  
  
    
    
      Markdown notes converted into HTML
    
  


Vimwiki

With Vimwiki you can:

  organize your notes and ideas in files that are linked together;
  manage todo lists;
  maintain a diary, writing notes for every day;
  write documentation in simple markdown files;
  export your documents to HTML.


Vim shortcuts

One of the main Vim advantages is the fact that it’s a modal editor,
which means that it has different edition modes. 
Each edition mode gives different functionalities to each key.
This increases the number of shortcuts without having to include multiple keyboard combinations. In Vimwiki this allows to write notes with ease.

When I want to write some notes, I just open Vim and then I use
ww to create a new note for today with a name based on the
current date.  is a key that can be configured in Vim, in my case it’s comma character (,).

If I want to look at my notes I can use ww to open the wiki index
file. I can use Enter key to follow links in the index. Backspace acts a return
to the previous page.

I use CoC snippets to improve autocompletion. In markdown, I find this plugin very useful to create tables, code blocks and links. You can use snippets for almost every programming language, just take a look at the documentation.

When I want to preview the markdown file, I use wh to convert the current
wiki page to HTML and I added also a shortcut to open HTML with the browser.

In the following video you can see an example of this workflow in action.


  
    Your browser doesn't support HTML5 video.
  


Searching in your notes

One of the advantages of digital notes are the fact that you can search quickly in multiple files using queries.

Vimwiki comes with a VimWikiSearch command (VWS) which is only a wrapper
for Unix grep command. This command can search for patterns in case insensitive mode in all your notes.

An excellent way to implement labels and contexts for cross-correlating
information is to assign tags to headlines. If you add tags to your Vimwiki
notes, you can also use a VimwikiSearchTags command.

In both cases, when you are searching in your notes, the results will populate
your local list, where you can move using Vim commands lopen to open the list, lnext to go to the next occurence and lnext for the previous occurence.

Hugo

Vimwiki has a custom filetype called wiki, which is a little bit different
from markdown.  The native vimwiki2html command only works for wiki
filetypes. If you want to transform your files to HTML using other filetypes, like markdown, you have to use a custom parser. Even if I’m not able to use Vimwiki native parser, I prefer markdown format because it’s very popular and simple.

These are some options to use as an external markdown parser:

  Pandoc, which I think works pretty good, but requires a lot of haskell dependencies.
  Python vimwiki markdown libraries, which I think has a lot of potential.
  Static website generators like Jekyll, Hugo or Hexo.


I started using static website generators because it can also publish easily the notes as static webpages, which I wanted to publish in Github Pages.

My first option was Jekyll, which is the Github native supported static website generator. It’s easy to use and the syntax is very straightforward, but I started to regret it when I accumulated a lot of notes. Then I decided to use Hugo, which is claimed to be faster and since it’s been coded in Go, it has no dependencies. In the following table I show my compiling time results for both:


  
    
      Compiling time
      Jekyll
      Hugo
    
  
  
    
      10 pages
      0.1 seconds
      0.1 seconds
    
    
      150 pages
      48 seconds
      0.5 seconds
    
  


I should say that I used Jekyll Github gems, which includes some unnecessary
ruby Gems, so I think Jekyll performance can be increased. It’s a nice software that I use to publish this post, but still Hugo is faster.

Building Vimiki with Hugo

The .vimrc file  contains vim configuration and it’s the place where one can
put the definition about Vimwiki syntax and writing directory.  As
you can see in my configuration, I use markdown syntax and save my files under
~/Documents/vimwiki/.

" ~/.vimrc

let g:vimwiki_list = [{
  \ 'auto_export': 1,
  \ 'automatic_nested_syntaxes':1,
  \ 'path_html': '$HOME/Documents/vimwiki/_site',
  \ 'path': '$HOME/Documents/vimwiki/content',
  \ 'template_path': '$HOME/Documents/vimwiki/templates/',
  \ 'syntax': 'markdown',
  \ 'ext':'.md',
  \ 'template_default':'markdown',
  \ 'custom_wiki2html': '$HOME/.dotfiles/wiki2html.sh',
  \ 'template_ext':'.html'
\}]


The custom wiki2html file correspond to a script which is executed to transform
markdown into HTML. This scripts calls Hugo executable file and tells Hugo to
use the Vimwiki file path as a baseurl in order to maintain link dependencies.

# ~/.dotfiles/wiki2html.sh

env HUGO_baseURL="file:///home/${USER}/Documents/vimwiki/_site/" \
    hugo --themesDir ~/Documents/ -t vimwiki \
    --config ~/Documents/vimwiki/config.toml \
    --contentDir ~/Documents/vimwiki/content \
    -d ~/Documents/vimwiki/_site --quiet > /dev/null


The complete version of my ~/.vimrc can be found in my dotfiles repository.

Deploy Vimwiki to Github Pages

Hugo projects can be easily published to Github using Github Actions. The
following script tells GitHub worker to use Hugo to build html at each push
and publish the HTML files to Github pages.

name: 🚀 Publish Github Pages

on: push
jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Git checkout
        uses: actions/checkout@v2

      - name: Setup hugo
        uses: peaceiris/actions-hugo@v2
        with:
          hugo-version: 'latest'

      - name: Build
        run: hugo --config config-gh.toml

      - name: 🚀 Deploy to GitHub pages
        uses: peaceiris/actions-gh-pages@v3
        with:
          deploy_key: ${{ secrets.ACTIONS_DEPLOY_KEY }}
          publish_branch: gh-pages
          publish_dir: ./public
          force_orphan: true


I like having one part of my notes published on Github Pages, at least the
configuration notes, which can be found in my Github
page. But there is also a part of notes
that I keep private, for example my diary notes, where I may have some sensible
information, so I keep it away from publication just by adding it to the
.gitignore file. Here you can find my Github notes
repository.

Conclusion

I like that fact that Hugo has no dependencies since it’s written in Go, so
it’s very easy to install, just download it from the github project releases
page. In addition is also a blazing fast static website converter, you can find
benchmarks in the internet.

I have been using Vimwiki very often, it allows me to take notes very easily
and also find information about things that happen in the past. When people ask
things about last month meeting I can find what I have written easily by
searching by dates, tags or just words.

Publishing my notes to github allows me to have a have a place where I can keep
track of my vimwiki configuration and also publish simple notes that are not
meant to be a blog post, like my install for arch linux or my text editor
configuration.


Reverse proxy management with Traefik and GoAccess
2020-06-07T00:00:00+00:00
The micro services philosophy consists on dividing applications in simple components.
This approach increases maintainability, since each application is not
coupled with others, so it can be tested, replaced and deployed independently.
This approach has become popular since the adoption of docker container technology.

A micro service project typically includes multiple docker containers, where
each container includes a separated functionality.

These containers can communicate in a private network and map ports with
the external network in order to expose services.

However, not every service includes a security layer, so it’s better to expose
a single application that serves as a router which controls every incoming requests
and send it to the right service.
This avoid exposing a service like a whole database connection.
Some applications that are able to act as a router are: Nginx, Apache server,
Caddy and Traefik.


  Apache server is the oldest one but it has been loosing followers since the
arrival of Nginx.
  Nginx is very popular and powerful web server, which can be adapted to multiple
kind of situations.
  Traefik is the new kid on the block. It has native docker
support, so it means that you don’t have to define custom Nginx routing
configurations, because it can connect directly to docker socket to
automatically detect changes on containers.


In this article I will show how to setup traefik using file system
configuration and also how to implement offline metric analysis using GoAccess
tool.

Traefik

Traefik is a reverse proxy, which routes incoming request to microservices.
It has been conceived for environments with multiple microservices, where a main
configuration is done to set-up Traefik, and then it dynamically detects 
new services comming from docker, kubernetes, rancher or a plain file system.
More information about traefik automatic discovery is available here.

This automatic discovery behaviour was the main thing that attracted me to use
Traefik, unlike Nginx, which refuse to start if a declared service is not available.
Traefik on the other side, it can run even if a declared service won’t run, and
if the docker starts it will be automatically detected by Traefik.

Traefik API

Trafik has a modern web interface to graphically inspect configuration. 
It shows information about:

  the entrypoints, which are the ports that Traefik is listening ;
  the running services and
  the routing rules, which defined how to direct the incoming request.


Traefik interface can be easily enabled in the configuration file. The following
lines tell Traefik to serve the interface in the Traefik entrypoint (8080 by
default). The debug option is useful for profiling performance and debugging Traefik.

api:
  insecure: true
  dashboard: true
  debug: true 


Here is a screen shot of the web interface, where one can see how one service is
configured.




Traefik API web interface. Mopidy service is encrypted using TLS.


In the following gist you can find the complete configuration file for Traefik.
The basic parameters to define are the entrypoints, where Traefik should be
listening and the encryption method. 
The providers configuration can be done in other plain file, or by adding
labels to docker, kubernetes, rancher, etc. In any case it dynamically detects 
changes on providers.




This configuration can be done in plain format if running outside a docker
container, but it can also be done by setting labels to Traefik docker
container.

Traefik security

By default Traefik will watch for all containers running on the Docker daemon,
and attempt to automatically configure routes and services for each container.
If you’d like to have more refined control, you can pass the
--providers.docker.exposedByDefault=false option and selectively enable
routing for your containers by adding a traefik.enable=true label.

Regarding HTTPS security, SSL connections can be easily configured in Traefik,
one can use a self signed certificate or connect automatically to Let’s
Encrypt in order to get an SSL certificate. The
renewal is also taken into account by Traefik. HTTPS redirection is also
available into Traefik parameters.

Traefik as a service

Traefik has been conceived to run as a docker container, but since it’s written
in GO, then it’s possible to run the compiled version as a standalone file in
several operating systems.

In the docker version, Traefik runs automatically when the container is power
on and the logs are scoped to the standard output.
However if you run the standalone file, then you have to configure Traefik as a
system service. I used the excellent information from this Gerald Pape
gist
to configure the Traefik service.

I prefer the standalone version in development environments like the raspberry pi
or jetson nano, where building docker images can be a little long.

GoAccess to monitor logs

GoAccess is a simple tool to analyse logs. It provides fast and valuable HTTP
statistics for system administrators that require a visual server report on the
fly. It can generate reports in terminal format, which is nice if your are
connecting on SSH, but it can also generate CSV, JSON or HTML reports.

Alternatives for this services are Matomo, which has the advantage of being
self hostable and open source.  Then you can be sure about how your colected
data is being used and that is not being sold to 3rd parties and advertisers.
However, Matomo has an extra client side javascript library which is required
in order to parse data, which is another dependency that I don’t want for
internal off-line environments.

Other popular alternative is Google Analytics, which has very powerful
reports and multiple of options that go beyond the scope of this article. The
only problem is that it’s not privacy compliant.

What makes GoAccess interesting, is that it generates detailed analytics based
purely on access logs from a web server, such as Apache, Nginx or in my case
Traefik. It’s written in C, and features both a terminal interface, as well as
a web interface. The way it’s designed to be used is by piping the access.log
contents into the GoAccess binary and providing any number of switches to
customize the output. Switches such as which log format you’re sending it, as
well as how to parse Geolocation from IP addresses.

In the following image you can see an example for GoAccess HTML dashboard. On
the top there is global information about the number of total requests, the
number of unique visitors, the log size, the bandwidth, etc.




GoAccess HTML dashboard


Real-time dashboard

GoAccess can be called using the command line, you can configure log format
using a command line parameter or using a configuration file. Default
configuration file can be found at /etc/goaccess.conf, but you can also pass
other configuration file using --config-file option.

Default output format is in the command line, but one can configure an html
using a specific output file. This option will create a static html report,
which can be continuously updated using the --real-time-html option.

The following code shows the systemctl file that I use to configure GoAccess as
a service for real time use.

[Unit]
Description=Goaccess Web log report.
After=network.target

[Service]
Type=simple
User=root
Group=root
Restart=always
ExecStart=/usr/bin/goaccess -a -g -f /var/log/traefik/access.log -o /var/www/html/report.html --real-time-html
StandardOutput=null
StandardError=null

[Install]
WantedBy=multi-user.target


GoAccess doesn’t include a static web server, so it can not expose the produced
html by himself. But one can easily configure an Nginx static server to
expose the static files, as show in the following Nginx virtual server:

server {
    listen 8082;
    listen [::]:8082;
    server_name  locahost;

    gzip on;
    gzip_types      text/plain application/xml image/jpeg;
    gzip_proxied    no-cache no-store private expired auth;
    gzip_min_length 1000;

    root /var/www/html;

    # Add index.php to the list if you are using PHP
    index index.html index.htm index.nginx-debian.html;

    location / {
            try_files $uri $uri/ =404;
    }
}


Conclusion

Traefik is a static webserver which is well adapted for dynamic configurations.
Even if still a young project and is not as performant as Nginx, it has an
interesting approach and some nice features.
For example in docker applications, it automatically knows the internal IP
address of a service to redirect the incoming request.

GoAccess is a very good tool to provide insights from logs in a close
environment where you can not share your stats with the exterior. Since it has
been written in C, the reading performances are very good, being able to parse
400 millions of hits in 1 hour and 20 minutes, according to GoAccess
FAQ.


Muse: Mopidy web client with Snapcast support
2020-04-26T00:00:00+00:00
Mopidy is a versatile music server that can play music from different sources
(TuneIn, SomaFM, Soundcloud, Youtube, Spotify, between others).
It can be connected with Snapcast to
provide a multiroom streaming service.
It has a main core API and several extensions that can be added optionally by
the user.
Between there extension there are frontend web applications that can be
connected to the core API and control Mopidy.

The historical web interface of Mopidy is
MusicBox, which
started being developed in 2013. It has the basic functionalities of Mopidy and
it’s compatible with several web browsers but the problem is that it doesn’t
include support to control Snapcast sources.

There are more recent web interfaces like Iris, with a gorgeous design and
support for Snapcast control, but it collects utilization
analytics for each client, which I
prefer to avoid.

Then, I decided to develop my own web interface using the state of the art web
technologies which are described in this post.




Muse webclient homepage


Web developpemnt technologies

I choose the Svelte framework as the core of the web interface. This
framework is very light and I like the syntax because I feel that I’m only
writing html, css and js but behind the scenes the code is completely
optimized at build time.

I also chose Sapper (short for Svelte app maker) to be the glue
between the different Svelte components. In the following section I’m
explaining how I used those Sapper and Svelte together.

Svelte

Svelte framework is relatively new (released in 2016) but it’s becoming very
popular because it’s very light and fast. This is mainly due to the fact
that it optimizes the code to manipulate the DOM directly using vanilla
JavaScript, which is different from traditional frameworks like React or Vue,
which need a framework in the browser to manipulate the DOM.

One of the advantage of Svelte is the reactivity system. This system is
inspired in the philosophy of Excel cells, that are linked one to the other
using formulas. When a value in a parent cell changes, all the linked cells are
updated automatically. The same happens with reactive variables in Svelte, the
variables are updated depending of a relation dependency graph.

I use reactive variables for saving information about the current track that
it’s playing, the songs that are present in the tracklist, the songs that are
present in a playlists and the results of a search action.

Another advantage of Svelte, is the fact that animation can be easily adapted
to components. I add some drag and drop events to order the tracks playlists as
you can see in the following video:


  
    Your browser doesn't support HTML5 video.
  


The order of the items in the tracklists is synchronized with the values in the
backend, so when the next song button is pressed, the song that has been
dragged is played.

REPL

Svelte has a Read–Eval–Print Loop (REPL) page, where people can share their
snippets. I think it’s a great way to test small things and share them with
everybody. For instance, here is a REPL page for the algorithm that moves the
items in
theplaylists.

CSS framework

I used the Bulma CSS framework because of it’s simplicity. I added the SCSS
version as a global variable using some components. Once the design is 
stabilized, I can use local CSS in each component to optimize the
application.

Sapper

Sapper has been developed by the Svelte team, so it follows the same
principles and simplicity of the Svelte family.
However, this framework is not still as popular as Svelte and also not as
mature.

I use Sapper as a routing application for the different pages:


  The homepage that shows the actual playing tracklist ;
  The search page to explore tracks ;
  The playlists page with the possibility of creating, deleting, playing and
modifying the every playlist ;
  A template for every playlist.


Sapper is also in charge of preparing the server side rendering mode, where
part of the JavaScript code is executed on the server side, so that the browser
can load faster the webpage.

Development environment

Mopidy core API runs on a backend which recently has reached version 3.0. This
means that it needs to run on python 3.7, which is not installed by default
in all desktop environments. To avoid version compatibility and break internal
dependencies I also created a Dockerfile with support of Mopidy 3.0, which
helps a lot for local development.

A Makefile is used to inject envrionement variables, like the location of the
music and playlist folder. These can be easily tunned depending on the running
environment.

Snapcast support

Snapcast is a multiroom client-server audio player, where all clients are time
synchronized with the server to play perfectly synced audio. It’s not a
standalone player, but an extension that turns your existing audio player into
a Sonos-like multi room solution.

I have been using it in three Raspberry Pi, to have the same music in different
house environments. I works well with Mopidy and initial configuration is very
minimalistic. There is one master with the Mopidy instance and Snapcast server
and there is a Snapcast client in every other.

Sanpcast also provides a JSON/RPC control API to communicate with the server using
websockets. This protocol provides a continuous link between the client and the
server. Messages are send when there are changes or events, which allows the
client to have a notification when there are change in the server.

I implemented this communication in Muse, so that one can control the volume
of the Snapcast clients. The following image shows the Snapcast control panel
and the sound from three devices: raspi, raspimov and raspicam.




Snapcasts controls in Muse


Github Actions CI/CD worflow

There are two main components of Muse web client:


  The html produced by Sapper
  The python package that connects the web interface with Mopidy.


The final destination of the python extension is the PyPi
repository, where
python users can download and install Muse. The publishing action can be
made manually using python wrappers like
twine. However, there is always the risk of
making a human mistake, like publishing the wrong version. This is why I prefer
to delegate this action to Github.

Here is a description of the deploying pipeline of the Muse package using
Github Actions:


  At every push to the github repository a build of the html and the python
package is made in order to check for errors.
  At every push to master branch there is a publication of the html to github
pages in order to validate the test the actual version in different devices
using this url of the github pages.
  The git tags control the version of the package, so when a tag is pushed, the
following actions are run:
    
      The package version is upgraded in setup.cfg and package.json
      The python package is published to PyPi.org
      A github release is published
      The zip of the package is added to the release assets
    
  


Conclusions

There is still plenty of room for improvement for Muse. I would like to add
features like integrating third party services like Discogs or Genius,
improving user experience or improving the design. If you think about other
nice features to add, you can open an issue on the github page of the
project.

I’m also satisfied of having an automatic deploying workflow. I used different
github actions developed by others and also contributed to improve one of
them.

The code of Muse is available at this github
repository and the python package is
available at PyPi, you can install it
with sudo python -m pip install Mopidy-Muse, use it without restriction.

Compiling time	Jekyll	Hugo
10 pages	0.1 seconds	0.1 seconds
150 pages	48 seconds	0.5 seconds