Academic
Blog
Analytics

Application revenue prediction

11 mins read time ❤️ Views


Introduction

In the competitive landscape of the digital applications industry, maximizing revenue generation is a paramount objective. This exercise aims to develop a predictive model capable of forecasting application-generated revenue by leveraging a combination of user features and engagement metrics. By understanding user characteristics and their interactions with the application, we can gain valuable insights into revenue potential and optimize monetization strategies.

Data

The dataset provides information about users application installation features and also the engagement features through 120 days. The dataset is divided in 2 parts:

  • The train data, which includes 1.7 millions of lines.
  • The test data, which includes 12.5 thousands of lines.

Each line contains information about an user. The train dataset contains information of users that installed the application from February until the end of November.The test dataset contains information from users that installed the application the 1st of December.

Individual user features

These informations are gather the first day that the user installs the application.

Install date

This represents the day when the users install the application. We can see that there has been a tendency increase from February until December. There are some periods where the installations have been higher than other like in the first part of May or October. However there are some outlayer days, where the installations have been extremely low, like August 16th, September 22th, or November 2nd, 4th or 12th.

One can also notice that the install day has a weekly seasonality, more install occurs during the weekend.

Platform

In this dataset applications are mainly installed in IOS systems.

Personalised ads

This represents whether the user opted in for personalized ads or other services. The train dataset have a higher number of users that didn’t choose for personalized ads. In the test dataset, the repartition is almost at the same level.

App Id

There are two application that have been installed. Both in the train and the test set.

Country

This represents the country where the app has been downloaded. We can see that the installations are mainly in the US market.

Ad Network

This represents the ID of the ad network that displayed the ads to the user. There are 3 main ad networks in the train and the test set.

Campaign type

In marketing terms, this is category of the ad campaign that was used to acquired the user.

Campaign id

There are more than 178 campaign_ids, counting some of them which are unique.

index campaign_id count
0 ... 559244
1 da2... 317648
2 99a... 203638
3 c6d... 114357
4 281... 70904
... ... ...
174 blo... 1
175 gam... 1
176 blo... 1
177 dow... 1
178 ぶろっ... 1

A common technique to deal with this behaviour is to group the less common ones. This can be achieved with the following python function.

def get_most_popular(series, threshold=None, top_values=None):
    if threshold is not None:
        popular_idx = series.value_counts().to_frame().query('count > @threshold').index.tolist()
        new_series = series.copy()
        new_series.loc[:] = "other"
        new_series.loc[series.isin(popular_idx)] = series
        return new_series
    if top_values is not None:
        popular_idx = series.value_counts().to_frame().head(top_values).index.tolist()
        new_series = series.copy()
        new_series.loc[:] = "other"
        new_series.loc[series.isin(popular_idx)] = series
        return new_series

The threshold can be changed in order to have a number of categories that are representative for a group.

campaign_id count
0 ... 559244
1 da... 317648
2 99... 203638
3 c6... 114357
4 28... 70904
... ... ...
96 20... 11
97 ブロ... 11
98 MA... 10
99 xx... 9
100 17... 8

Model

The model of the devices also contains several occurrences, with ids that sometimes are unique, so applying the same grouping function, one could obtain the top 100 models. Following the platform repartition, we find Iphones and Ipads in the top devices. One should notice that IPhoneUnkown and IPadUnknown are not in test set.

Manufacturer

This represents the manufacturer of the user’s device. The main manufacturer is Apple, followed by Samsung and Google. The repartition of manufacturer follows the same distribution for the train and the test datasets.

Mobile classification

This classification can be related with monetary value of the mobile. High end devices have the best classification Tier 1, like the last Iphone or Iphone. The cheapest telephones will be on Tier 5. Some of them doesn’t have classification.

City

The city where the user downloaded the app. The most popular cities are Tokyo, Chicago and Houston. There are come Japanese cities that doesn’t appear in the test dataset like Otemae, Tacoma or Nishikicho.

Other variables

There are some variables that are not useful for revenue predictions such as ` game_type and user_id` since they are all unique values.

User engagement features

During the life of the application, the user might interact by clicking through ads or by buying things on the platform. These features are measures using a cumulative value for day: 0, 3, 7, 14, 30, 60, 90 and 120.

Revenues generated by the application

The main revenues are measured by the variable dX_rev, where X is the number of days where it’s measured. This is variable that is going to be predicted.The distribution is similar for test and train dataset. This variable is treated in log form since it variates mostly between 0 and 1. 70.0% of users produces less than $1 of revenue.

This variable is decomposed in dX_iap_rev and dX_ad_rev which are the revenues for ads and purchased items respectively. It is possible to notice that the total revenue is mainly pushed by ads.

Correlation of values

The correlation matrix allows to find correlated features.

The following variables are correlated:

  • “iap_ads_rev_d0” and “iap_ads_count_d0”: Which means that the revenue from ads in correlated with the number of ads.
  • “iap_coins_count_d0” and “iap_count_d0”: Which means that the number of coins bought by the user is correlated with the number of items bought by the user.
  • “iap_coins_rev_d0” and “d0_iap_rev” and “d0_rev”: Which means that revenue from coin purchases is correlated with the revenue from in app purchases and the total revenue.

The correlated values can be removed from the modelling stage.

Evolution over time

The cumulate value changes during the days. However, in the following figure, we can notice that the revenue distribution remains the same. The distribution reduces because there is less data with d120_rev.

Missing values

The objective of the problem is to predict revenue on the day 120, but this value is available for every line in the dataset because we don’t have the information from the future. For instance, the test dataset is from 1 December, so we only have information for day zero.

Modelling

To predict revenue at day 120, a traditional approach can be employed using structured data as the training dataset and the d120_rev column as the target variable. This approach leverages the existing information in the dataset to forecast future revenue.

Feature selection

Based on variable exploration, the following modifications were made to user installation features:

  • Categorical values such as campaign_id, model, manufacturer, and city were consolidated by grouping less frequent categories together.
  • Utilizing the installation_date for each user, additional features were derived, including the number of installations per day, the day of the week, and the month.
  • To highlight the distinct nature of Apple devices, a binary column was introduced to indicate whether the installation was made on an Apple device.
X = (
    df
    .assign(
        campaign_id=lambda x: get_most_popular(x['campaign_id'], threshold=THRESHOLD_POPULAR_CAMPAING),
        model=lambda x: get_most_popular(x['model'], threshold=THRESHOLD_POPULAR_MODEL),
        manufacturer=lambda x: get_most_popular(x['manufacturer'], top_values=TOP_MANUFACTURER),
        city=lambda x: get_most_popular(x['city'], top_values=TOP_CITIES),
        is_optin=lambda x: x['is_optin'].replace({1:"optin", 0: "not_optin"}),
        installations_perday=lambda x: x['install_date'].dt.strftime('%Y-%m-%d').to_frame().merge(installations_perday, how='left', left_on='install_date', right_index=True)[0],
        installation_day=lambda x: x['install_date'].dt.day_name(),
        installation_month=lambda x: x['install_date'].dt.month,
        is_apple = lambda x: x['manufacturer'].str.contains('apple', flags=re.IGNORECASE, na=False).replace({True:"is_apple", False: "not_apple"}),
        mobile_classification = lambda x: x['mobile_classification'].replace(r'^\s*$', 'unkown_mobile_classification', regex=True),
    )
    .assign(**{col: lambda x, col=col: x[col].astype('category') for col in user_cols})
    .get(engagement_cols + user_cols + [target_col, 'dataset'])
)

To ensure that each class is treated equally and to accommodate the requirements of models that cannot handle categorical variables directly, certain columns were transformed using one-hot encoding. This technique converts categorical values into numerical representations, enabling the model to process them effectively.

one_hot_columns = ['is_optin', 'platform', 'app_id', 'country', 'ad_network_id', 'campaign_type', 'mobile_classification']
X = pd.get_dummies(data = X, prefix = 'OHE', prefix_sep='_',
               columns = one_hot_columns,
               drop_first =False,
              dtype='int8')

While the numerical values were left intact, normalization might be considered to improve model performance. Correlated features were removed based on exploratory data analysis. To ensure realistic predictions on day zero, information from features like (d3, d7, d14, …) was excluded from the training dataset. This prevents the model from over-relying on these features and improves its ability to make accurate predictions in cases where such information is unavailable.

Train test split

The dataset was randomly divided into training, testing, and validation sets, with proportions of 60%, 20%, and 20%, respectively. The training and validation sets were used during the model training phase to optimize hyperparameters and evaluate performance. The testing set was reserved for final model evaluation and comparison with other models.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

Model selection

Gradient boosting models have demonstrated exceptional performance in similar Kaggle competitions, making them well-suited for this task. I selected three popular gradient boosting models: XGBoost, CatBoost, and LightGBM. To predict the revenue value, I employed the regression version of these models. The RMSE metric was chosen to monitor the models’ performance throughout the training process.

The following figure illustrates the evolution of the loss function for both the training and validation datasets. Training was halted when the validation loss stopped decreasing to prevent overfitting.

In order to find the best parameters of the model I used the open-source python library hyperopt. Hyperopt is used for hyperparameter optimization of machine learning models. It allows you to define the search space for your parameters and the optimization strategy. Hyperopt then tries a certain number of configurations to find the best possible set of parameters for your model.

In the context of gradient boosting models, some of the parameters you might want to tune using Hyperopt include:

  • The number of leaves and the maximum depth of the decision trees.
  • The min_child_weight parameter, which stops the algorithm from splitting a node further if the number of samples in that node falls below a certain threshold.
  • The learning rate and the number of estimators used by the model.
space={
    'enable_categorical': True,
    'learning_rate': hp.quniform("learning_rate", 0.05, 0.1, 0.05),
    'max_depth': hp.quniform("max_depth", 3, 18, 1),
    'max_leaves': hp.quniform ('max_leaves', 1,9, 1),
    'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
    'n_estimators': 2000,
    'early_stopping_rounds': 5,
    'device': "cuda:0",
    'seed': 0
    }


def objective(space):
    clf=XGBRegressor(
         enable_categorical= True,
        learning_rate=space["learning_rate"],
        max_depth=int(space["max_depth"]),
        max_leaves=int(space['max_leaves']),
        min_child_weight=int(space['min_child_weight']),
        n_estimators=int(space['n_estimators']),
        early_stopping_rounds=int(space['early_stopping_rounds']),
        device= "cuda:0",
        seed=0
    )
        
    evaluation = [( X_train, y_train), ( X_test, y_test)]

    clf.fit(X_train, y_train, 
            eval_set=[(X_train, y_train), (X_val, y_val)],
            verbose=False)
    
    pred = clf.predict(X_test)
    mae = mean_absolute_error(pred, y_test)
    rmse = root_mean_squared_error(pred, y_test)
    print(f"mae: {mae}, rmse: {rmse}")
    return {'loss': rmse, 'status': STATUS_OK }
    
trials = Trials()
best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 100,
                        trials = trials)

The predictions from the three models have been combined using a weighted average.

Results

One of the advantages of gradient boosting models is their interpretability. These models can identify the most influential variables in making predictions. The following figure presents the feature importance for the XGBoost model. The installations per day feature is the most significant, providing valuable context about the day the user installs the application. Another important variable is the city where the user downloaded the app, as it implies information about the user’s demographics. Engagement variables, such as total revenue from ads, are also crucial for the final decision.

The RMSE of the model on the test dataset is 0.748. Since the target variable was transformed using a logarithmic function, this RMSE corresponds to a predicted error of approximately $1.11, which represents the most common error for each use case prediction. The following figure illustrates the distribution of the actual target values in the test dataset and the predicted values generated by the model. The model tends to predict values between 0.1 and 1. It struggles to accurately predict the behavior of the target variable in the range between 0.001 and 0.01.

Conclusion

This study aimed to develop a model to predict user-generated revenue within a digital application. By leveraging user installation features and engagement metrics, the model can provide valuable insights for optimizing monetization strategies.

The analysis revealed several key factors influencing revenue generation. Features like installation day, user city, and total ad revenue were identified as the most impactful on the model’s predictions.

The XGBoost model achieved an RMSE of 0.748 on the test dataset, which translates to a predicted error of approximately $1.11 for most user cases. However, the model exhibits limitations in accurately predicting low-revenue users (target values between 0.001 and 0.01).

Overall, this study demonstrates the potential of gradient boosting models for user revenue prediction within digital applications. By incorporating additional features and refining the model further, one can potentially improve prediction accuracy and gain deeper insights into user behavior that can be harnessed for effective revenue generation strategies.

In relation with 🏷️ kaggle, python:

Artificial Content Generation

This articles presents how to generate content using large language and vision models. The content is then shared websites and social networks.

📅 Dec 7, 2024   python   llm
Personalized Plan Care Information

Using RAG and LLM to provide accurate information about plant care.

📅 Jul 28, 2024  Hackster.io   ecology   python   llm
Music playlists dashboard

This article shows how to collect data from article shows how to analyze logs using Kibana dashboards. Fluentbit is used for injecting logs to elasticsearch, then it is connected to kibana to get some insights.

Magic wand gesture recognition using Tensorflow and SensiML

This article aims to demystify the implementation of machine learning algorithms into microcontrollers. It uses runs a TensorflowLite model for gesture recognition in a QuickFeather microcontroller.

Subscribe to my newsletter 📰

and share it with your friends: