A recurring question from Engineers working on production forecasting and reserves estimation is how to effectively use clustering to group similar production curves in order to effectively plan for different field development scenarios. Clustering is one of the most important unsupervised learning methods, since it allows us to discover non trivial relationships between elements, i.e. production curves.
It is important to note that clustering and classification are not the same thing. A clustering method would help us discover classes of production curves from historical production data. On the other hand, a classification method would label any incoming production curves based on how well they match with existing classes. Fig. 1 below helps clarify the distinction between clustering and classification. Nevertheless, once production classes have been properly identified with a clustering algorithm, it is useful to relate geological, completion, and drilling data to predict production classes via classification techniques.
There is a plethora of commercial and free solutions that can be used to perform clustering. Two of the most common implementations are the K-means and hierarchical clustering methods. However, most off-the-shelf products offer generic clustering algorithms with limited configurability in terms of similarity or distance metrics. This lack of configurability may hinder our ability to obtain clear and consistent cluster separation when tackling real world scenarios in the Oil and Gas industry, which often leads to erroneous conclusions.
On the other hand, implementing a more robust custom solution that adheres to particular engineering or physical situations can be challenging and time consuming. This usually requires the combination of subject matter expertise with strong mathematical knowledge in order to derive meaningful solutions.
In this article, we will compare what happens when we naively apply K-means to cluster production curves, and then contrast it with an approach that accounts for production features to improve curve alignment. Would the clustering results still hold after several months of production? Let’s find out.
To begin, let us select 12 arbitrary cumulative production curves and try to generate three production groups: Low, Medium, and High. For the sake of simplicity and without loss of generality, we would assume that both time and cumulative production values have been normalized between the range 0 to 1. For simplicity, we will apply clustering on the raw data using these two methods: The traditional K-means clustering method against a modified version that accounts for the physics of the problem.
With the traditional K-means clustering we use the standard Euclidean distance or L2-norm to characterize the similarity among curves. As for the physics-based K-means approach, we are going to account for time-varying patterns that typically arise in production: flow regime changes, well shut-ins, formation damage, or mechanical issues that could be inducing pronounced changes in pressure or rates. To that end, we will define suitable similarity indices between production curves using Procrustes procedures that will rely on scaling, translation and rotation components to compare curves shapes under a common reference system. The proposed approach can be actually related to Dynamic Time Warping (DTW) which is used to compare temporal series and will allow us to reduce clustering uncertainty as we will see next.
In Fig 2. below, we observe the typical outcome when the K-means algorithm is applied with different random initial guesses. There are three different cluster arrangements for the same set of twelve production curves (the right upper and lower plots have the same cluster configuration). The dotted lines indicate the centroids of each cluster which also vary noticeable across all plots. This variation makes it difficult to characterize a representative Low, Mid and High production model. Moreover, using the same set of random seeds (i.e. A, B, C, D) for a longer production history, there is no way to relate the new cluster configuration depicted in Fig. 3 with those of Fig. 2. Hence, the clustering is unstable.
Can we do better using the same K-means framework for the sake of improving decision making? Yes, indeed.
The goal of the Procrustes approach is to be able to superimpose multiple events in time. With a sequence of scaling, translation, and rotation transformations we can basically align multiple events contained in different portions of the cumulative production curves and make the clustering more resilient. Fig. 4 below shows the results using the physics-based K-means clustering for the same initial guesses used correspondingly in each of the 4 cases shown in Fig 2. We can immediately see that the results are all identical regardless of where we start the centroids in the clustering process.
These results give us more confidence about the insights we may derive when assessing future production. As seen in Fig. 5, the curves remain relatively stable and consistent over time.
We now observe little variation in the overall results. We also see that seeds A and D, as well as B and C generate very similar clustering patterns. Note that there is not variation in the high production cluster shown in blue. The variation occurs between the mid (green) and low (red) clusters with a cumulative curve showing a mix behavior among members of these two clusters. It is a curve that was identified as a low producing one for 70% of the history in Fig 4., but then ramped upduring the remaining 30% of the production history.
The proposed clustering model yielded results that are fairly convincing and was able to capture the desired curve features without recurring to a computationally expensive DTW approach. This approach can also scale far better than the DTW approach, because it doesn’t require an optimization procedure to minimize a warping path.
It’s important to note that the aforementioned approach is not constrained to the K-means and can be adapted to other clustering models such as hierarchical, distribution-based, density-based, and others. Lastly, a similar approach to the one illustrated here could be extended to many more Oil & Gas applications where unsupervised assessment of data may be a viable choice for data sets with limited labeled data.