This article describes the key points of my participation at the 2021 Edition of the World Data League. The Tech Moguls Team, composed of me, Tiago Gonçalves, Tomé Albuquerque and Joana Morgado, from INESC TEC, finished second place in this edition.
World Data League (WDL) is a Data Science competition where groups of Data Scientists work to solve social problems using data. There were several main topics – Public Transportation, Traffic, Cycling, Environment. Each one was broken down into 4 smaller sub-topics, originating a two weeks Stage per sub-topic.
The finals were a 3-day event with the top-10 finalists, about Noise Pollution. All the code, data and challenge descriptions are available on the official WDL gitlab. For more detail than the provided here, please see the notebooks linked below.
The way we think about problem-solving at NILG.AI allowed me to participate in this challenge. The whole Lean Data Science pipeline – creating a baseline solution, thinking about how the end user could use it, and calculating business metrics besides technical metrics – is very relevant when developing solutions that generate value. In this article, we show our way of reasoning.
Stage 1 – Public Transportation
In this stage, we covered churn models in public transportation. The dataset contained two periods of time (including the COVID-19 lockdown periods in Portugal) with the average number of bus users per day aggregated per different locations, gender, and age groups. The goal was to identify churn profiles and propose measures to reduce them.
For this, we used a Decision Tree to predict the probability of a certain segment increasing or decreasing the usage of public transportations, throughout the two periods given, with variables we considered to be relevant, in order to create groups that could be used to explain churn. The tree’s branches would give us information about the segments and their size.
We discovered two segments with a high propensity for churning:
Our first segment refers to users from the South of Portugal, whose ages are not in 65+ nor 25-34
Our second segment refers to users from 25-34 a little bit all over the country
The most relevant variables that explained this decrease were:
Population Density in the County and District
Relative Change in Unemployment
Variability in demand, extracted from the Origin-Destination matrix, which can be a proxy for the easiness of flow going out of a county into the parishes
The plot below shows a typical example of the average traffic flow in the city of Porto: it decreases during the night, and starts to increase around 04:00/05:00, which marks the time-points when people start their working routine. It then keeps increasing until 10:00/11:00 and has an approximately stable behaviour until the end of the working hours, 18:00/19:00, starting to decrease afterwards.
Our solution focused on forecasting traffic for 24h hours later for the city of Porto. We used an XGBoost Classifier with weather features (current and forecast), historical intensity features (e.g. average intensity in the past), date features (whether it’s a holiday/weekend at prediction time, hour, week day, …) and sensor position features (distance to sensor centroid).
This could be used, for instance, for dynamically changing traffic light frequency depending on the area and time of the day.
This stage had another outcome – the acceptance of a paper for SoGood 2021 – The 6th Workshop on Data Science for Social Good (Paulo Maia, Joana Morgado, Tiago Gonçalves and Tomé Albuquerque – Applying Machine Learning for Traffic Forecasting in Porto, Portugal)!
Stage 3 – Cycling
In the third stage, we worked on (Literally) paving the way towards safer cities. In this challenge, we had access to Google Street Maps images in Lisbon, in four different angles (0 – 360º) and the goal was to estimate a score of perceived road safety based on objects in an image.
We labeled a subset of images with the following classes:
Irrelevant view: whenever the street is fully visible or the image is just pointing to a wall;
Street width: a single car could fit the street vs more than one car could fit there;
Pavement Type: parallels (paralelo), tar (alcatrão) or dirt (terra batida);
Pavement quality (low, high, or mid);
A pre-trained model for car detection was also used to count the amount of cars in each image. This could allow us to obtain traffic intensity as a proxy for danger.
As an example, here’s a subset of images that were labeled as irrelevant:
Afterwards, we associated a risk score with the presence of each of these, and averaged the score for each angle – which could be used for creating a street-level risk map.
The groups of 4 images below show low risk and high risk scores, respectively, based on our established rules.
Images with the bottom risk scores are images with pavement type “alcatrao”/tar, where the pavement quality is high (no visible cracks), and there are no cars present.
Images with the top risk scores are images with pavement type “paralelo”, where cars appear.
Stage 4 – Environment
For the fourth stage, we worked on Optimisation of outdoor advertisements in cities. Cities are flooded by countless outdoor advertising panels, often with a poor distribution”. Visual aspects are crucial in the urban planning process since each plan choice can generate obstruction of urban elements, thus producing adverse effects on the city’s image.
The dataset contained the coordinates of several billboards, as well as the average number of visitors.
Our approach considered that we could only add or remove billboards from a location, but had to replace them in another existing location, as we do not know which coordinates are valid locations for billboards.
We developed a metaheuristics-based algorithm (local/neighborhood search) that optimizes the outdoor-billboard density (reducing it) and the total number of views (i.e., the number of outdoor billboards in a given radius – increasing it). We start by creating neighbor solutions through swap operations in which we change the coordinates of a given billboard and assess the impact on our fitness function, which takes this variable into account.
The provided data was about noise sensors in the city of Torino, Italy, as well as points of interest and police complaints.
We developed an explainable XGBoost Classifier capable of predicting the probability of noise levels exceeding the legal limit for the next day (at the same time) in the neighbourhood. A model was used to predict the volume of complaints. Finally, both models were combined into an expected annoyance value: the probability of noise exceeding the threshold level AND causing annoyance (according to tabulated values in the literature) AND causing a complaint. This makes the decision very actionable as it combines reasoning for negative consequences.
The users of this solution could be the local police forces who, knowing that in a given area the next day the probability of the noise level exceeding the limit is high, can optimise patrols and organise the teams. This model, by presenting the possible cause of the probability being high, allows the police to know in advance what to expect on the spot.
Conclusion
This article described my participation in the 2021 WDL competition – there’s now an insights report available with a summary of all the project outcomes you can check out.
Like this story?
Subscribe to Our Newsletter
Special offers, latest news and quality content in your inbox.
Signup single post
Recommended Articles
Artigo
Perspetivas da IA: melhores práticas de planeamento estratégico para 2026
6 de janeirode 2026 em
“Lista: Resumo
Descubra as melhores práticas de planeamento estratégico para projetos de IA e dados para aumentar o ROI, a eficiência e a tomada de decisões em 2025.
Algoritmos de aprendizagem automática explicados: guia prático para modelos de IA
30 de dezembrode 2025 em
Guia: Explicação
Descubra os algoritmos de aprendizagem automática explicados com exemplos reais e orientações sobre como selecionar e implementar os modelos de IA adequados.
Um guia prático para reduzir o tempo de lançamento no mercado
22 de dezembrode 2025 em
Guia: Como fazer
Descubra como acelerar o seu lançamento com estratégias práticas para reduzir o tempo de comercialização. Aprenda a aproveitar a IA, a automação e os processos enxutos.
Utilizamos cookies no nosso site para lhe proporcionar a experiência mais relevante, lembrando as suas preferências e visitas repetidas. Ao clicar em «Aceitar tudo», concorda com a utilização de TODOS os cookies. No entanto, pode visitar «Definições de cookies» para fornecer um consentimento controlado.
Este site usa cookies para melhorar a sua experiência enquanto navega pelo site. Dentre eles, os cookies classificados como necessários são armazenados no seu navegador, pois são essenciais para o funcionamento das funcionalidades básicas do site. Também usamos cookies de terceiros que nos ajudam a analisar e entender como você usa este site. Esses cookies serão armazenados no seu navegador somente com o seu consentimento. Você também tem a opção de recusar esses cookies. No entanto, recusar alguns desses cookies pode afetar a sua experiência de navegação.
Os cookies necessários são absolutamente essenciais para o funcionamento adequado do site. Estes cookies garantem as funcionalidades básicas e os recursos de segurança do site, de forma anónima.
Cookie
Duração
Descrição
cookielawinfo-checkbox-analytics
11 meses
Este cookie é definido pelo plugin GDPR Cookie Consent. O cookie é usado para armazenar o consentimento do utilizador para os cookies na categoria "Análises".
cookielawinfo-checkbox-funcional
11 meses
O cookie é definido pelo consentimento de cookies do RGPD para registar o consentimento do utilizador para os cookies na categoria «Funcional».
cookielawinfo-checkbox-necessário
11 meses
Este cookie é definido pelo plugin GDPR Cookie Consent. Os cookies são usados para armazenar o consentimento do utilizador para os cookies na categoria «Necessários».
cookielawinfo-checkbox-outros
11 meses
Este cookie é definido pelo plugin GDPR Cookie Consent. O cookie é utilizado para armazenar o consentimento do utilizador para os cookies na categoria «Outros».
cookielawinfo-checkbox-performance
11 meses
Este cookie é definido pelo plugin GDPR Cookie Consent. O cookie é utilizado para armazenar o consentimento do utilizador para os cookies na categoria «Desempenho».
política_de_cookies_visualizada
11 meses
O cookie é definido pelo plugin GDPR Cookie Consent e é usado para armazenar se o utilizador consentiu ou não com o uso de cookies. Ele não armazena nenhum dado pessoal.
Os cookies funcionais ajudam a executar determinadas funcionalidades, como partilhar o conteúdo do site em plataformas de redes sociais, recolher comentários e outras funcionalidades de terceiros.
Os cookies de desempenho são utilizados para compreender e analisar os principais índices de desempenho do site, o que ajuda a proporcionar uma melhor experiência ao utilizador para os visitantes.
Os cookies analíticos são utilizados para compreender como os visitantes interagem com o website. Estes cookies ajudam a fornecer informações sobre métricas, como o número de visitantes, taxa de rejeição, fonte de tráfego, etc.
Os cookies publicitários são utilizados para fornecer aos visitantes anúncios e campanhas de marketing relevantes. Estes cookies rastreiam os visitantes em vários sites e recolhem informações para fornecer anúncios personalizados.