My projects
Data with a spatial component is ubiquitous today: the location history stored on a mobile device, the number of cars passing by traffic counters, the observations recorded by weather sensors are all examples of data with associated location information. Analysing this type of data can help extract spatial patterns that inform decision making in many areas, such as infrastructure optimisation.
Human mobility data
One fundamental task in analysing location data is computing distances between objects. These objects can be, for example, sequences of GPS locations, called trajectories, denoting the latitude and longitude of the places that someone travelled through in a given day. Given a set of such sequences, one problem of interest is to find similar trajectories, that follow the same path through a road network; in an urban context this could have applications such as discovering groups of commuters who move from one part of the city to another in the same time frame. To extract similar trajectories implies computing pairwise distances; however, exact computation is expensive, due to the number, size, and complexity of the trajectories. We devised new methods to represent trajectories that make distance computation and other similar tasks more efficient in terms of time and space used for computation. For more details, see the paper Multi-resolution sketches and locality sensitive hashing for fast trajectory processing.
Another application of analysing trajectory data is placing transportation resources efficiently. For example, data on cycling trips made in a city can inform how to place cycling lanes efficiently such that a large number of cyclists can use them, while respecting a limited construction budget. At the level of a city it is challenging to identify an optimal set of lanes because of the large number of trips to be analysed and the large number of candidate streets. We proposed a transit network design solution for placing resources under given constraints that can handle large datasets and gives good results in practice.
Analysing timeseries data for renewable power forecasting
World energy consumption is predicted to increase by 50% by 2050 due to urbanisation and increased standards of living. Given the high levels of greenhouse gases emissions caused by traditional energy sources, demand for sustainable alternatives is on the rise, with renewable sources being the fastest growing. Including renewable sources in the current energy system is difficult because power output forecasting, crucial to the optimal planning and running of renewable power plants, is challenging under variable weather conditions.
In my work we focus on forecasting solar photovoltaic (PV) power production. As one of the most popular sources of renewable energy, it is found in residential, commercial, off-grid and utility-scale domains and accounts for more than 2% of the global power consumption.
We proposed a day-ahead PV power prediction model for systems of solar power plants in regions of Hokkaido, Japan. The predictions are made based on numerical weather forecasts, clear sky forecasts and a persistence model. The ensemble architecture combines the outputs of a recurrent neural network, a support vector machine model, and a multivariable linear model, which exploit different patterns in the data. The ensemble gives a higher accuracy than the individual models in terms of root mean square error. Moreover, we investigated the influence of spatial factors on the accuracy of the ensemble and individual models. We show that having separate local models for self contained areas (based on similar weather conditions) gives higher accuracy than a single model that predicts the total output. Moreover, models including the weather forecasts from the neighbouring region perform better than the ones restricted to the area of prediction. This is particularly interesting given that the effect is noticed in both temporal and non-temporal models. We show that the ensemble model can be further improved by learning the weights of the individual models based on their performance on the training step.
Music genre embeddings
Musical genres are inherently ambiguous and difficult to define. Even more so is the task of establishing how genres relate to one another. Yet, genre is perhaps the most common and effective way of describing musical experience. The number of possible genre classifications (e.g. Spotify has over 4000 genre tags, LastFM over 500,000 tags) has made the idea of manually creating music taxonomies obsolete. We propose to use hyperbolic embeddings to learn a general music genre taxonomy by inferring continuous hierarchies directly from the co-occurrence of music genres from a large dataset. We evaluate our learned taxonomy against human expert taxonomies and folksonomies. Our results show that hyperbolic embeddings significantly outperform their Euclidean counterparts (Word2Vec), and also capture hierarchical structure better than various centrality measures in graphs. For more details, see the paper Hyperbolic Embeddings for Music Taxonomy.