What takes time when processing a LAScatalog
is not
necessarily the computation itself but the time required to read the
files. In fact the read time (i.e the time needed to load the data in R)
might be much longer than the actual computation time. This vignette
explains why and how to speed-up the computation by a factor of 2 to 8
by carefully preparing the catalog.
LAScatalog
processingWhen processing a LAScatalog
the area covered is divided
into chunks that are then processed sequentially. In the following
examples we present the case where chunks are equal to tiles, i.e. when
processing each file sequentially. This is the common way to process
data but not the only one. In any case, the explanation remains valid
even when chunks are not equal to tiles.
So each file is processed sequentially. For a given processed file, the content is read and loaded into R. In the following figure chunk number 42 is currently read and processed (in blue):
But the current processed file is not the only one that needs to be read. To properly process the catalog without edge artifacts we need to load an extra buffer around the processed file (in red).
To load a buffer the processing engine must not only read the processed file but also the 8 neighbouring tiles (in red) to selectively read a small buffer around the processed file. Thus, for each processed file it is not one file that is read but nine. This is one of the reasons why the read time is far from negligible compared to the actual computation time.
To process a LAScatalog
faster we need to read the files
faster.
A laz file is a strongly compressed las file. Reading a laz file is thus slower than reading a las file because it must be un-compressed on the fly at reading time. The following graph shows a benchmark of read time for a single file.
So let’s assume that the total computation time is 1 unit of time divided into 0.25 units of actual processing time and 0.75 units of read time (which is a fairly reasonable ratio). We can divide the read time by 3, and thus have 0.25 units of read time and 0.25 units of computing time, which gives a computation time of 0.5 instead of 1. We can therefore speed-up the computation time by a factor of 2 by using the las format instead of laz. Obviously the gain is less significant for more computationally demanding processes.
So for faster computation users can opt for las files instead of laz files. Obviously, there are good reasons to use laz files instead of las files. The strong compression brought by the laz format has a lot of advantages for storing data. It is up to the user to choose a format by considering the trade-offs between space and computation time. This section explains how it works only to help users make a decision that best suits their needs.
Another way to speed-up the total computation time is to avoid reading all 8 neighbouring tiles to load a buffer. Instead, we can read only parts of the neighbouring tiles. The gain comes from the fact that we read only a small portion of the neighbouring files to extract the buffer, skipping most of the file contents. Indeed, the buffer usually corresponds to only a very small percentage of the actual contents of a file (equivalent to a few thousands square meters).
This is made possible by indexing the las or laz files with lax files. A lax file is a tiny file associated with a las or laz file that spatially indexes the points to make faster spatial queries. This file type was created by Martin Isenburg in LAStools. For a better understanding of how it works one can refer to a talk given by Martin Isenburg about lasindex.
By adding lax files along with your las/laz files the buffer can be added around the processing file by only partially reading the 8 neigbouring files (in red)
The best way to create a lax file is to use laxindex from LAStools. It is a
free and open-source part of LAStools. If you cannot or do not want to
use LAStools the lidR package has an (undocumented) internal function
that creates lax files using the rlas
package:
The gain is really significant and allows an additional 2- to 3-fold saving in terms of read time, which significantly speeds up the computation time. Changing from laz to las format has a cost because it implies storing more data. However, using lax files provides a significant gain for free, so there is no incentive not to create lax files.
We demonstrated the importance of decreasing the time taken to read files to improve the overall computation time. The faster you read the files the faster you perform the computation because the read time is non-negligible. When reading a las file the computation time can be roughly split in 3 equal parts. (1) 1/3 is actual reading, (2) 1/3 is data storage in C++ data structures, (3) 1/3 is copy of the data into R objects. When reading only attributes of interest we can speed-up the steps (2) and (3). This is not as significant for laz files because step (1) is more than 1/3.
For a fast reading, opting for a fast SSD disk instead of a slow HDD disk may significantly speed-up the computation time independently of the power of your processor. Hardware matters!
The following are benchmarks for some functions
A simple point-to-raster canopy height model using 25 files of 150 x 150 m with 30 pts/m² on a laptop with an SSD and an intel core i7 processor.
Format | Runtime |
---|---|
laz | 40 sec |
laz + lax | 20 sec |
las | 10 sec |
las + lax | 7 sec |
Here we found an almost 8-fold increase in speed simply by changing the file types.
Computation of a single metric on 360 files of 1 x 1 km with 3 pts/m² (~300 km² and 900 millions points) on a laptop with an SSD and an intel core i7 processor.
Format | Runtime |
---|---|
laz | 45 min |
las | 15 min |
las + lax | 8 min |
Here we found an almost 6-fold speed-up by changing only the file types.