All configs can be set on startup, but some configs, especially for shuffle, will not work if they are set at runtime. Please check the column of “Applicable at” to see when the config can be set. “Startup” means only valid on startup, “Runtime” means valid on both startup and runtime.
General Configuration
Name
Description
Default Value
Applicable at
spark.rapids.cloudSchemes
Comma separated list of additional URI schemes that are to be considered cloud based filesystems. Schemes already included: abfs, abfss, dbfs, gs, s3, s3a, s3n, wasbs, cosn. Cloud based stores generally would be total separate from the executors and likely have a higher I/O read cost. Many times the cloud filesystems also get better throughput when you have multiple readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type
None
Runtime
spark.rapids.filecache.enabled
Controls whether the caching of input files is enabled. When enabled, input datais cached to the same local directories configured for the Spark application. The cache will use up to half the available space by default. To set an absolute cache size limit, see the spark.rapids.filecache.maxBytes configuration setting. Currently only data from Parquet files are cached.
false
Startup
spark.rapids.memory.gpu.maxAllocFraction
The fraction of total GPU memory that limits the maximum size of the RMM pool. The value must be greater than or equal to the setting for spark.rapids.memory.gpu.allocFraction. Note that this limit will be reduced by the reserve memory configured in spark.rapids.memory.gpu.reserve.
1.0
Startup
spark.rapids.memory.gpu.minAllocFraction
The fraction of total GPU memory that limits the minimum size of the RMM pool. The value must be less than or equal to the setting for spark.rapids.memory.gpu.allocFraction.
0.25
Startup
spark.rapids.memory.host.spillStorageSize
Amount of off-heap host memory to use for buffering spilled GPU data before spilling to local disk. Use -1 to set the amount to the combined size of pinned and pageable memory pools.
-1
Startup
spark.rapids.memory.pinnedPool.size
The size of the pinned memory pool in bytes unless otherwise specified. Use 0 to disable the pool.
0
Startup
spark.rapids.sql.batchSizeBytes
Set the target number of bytes for a GPU batch. Splits sizes for input data is covered by separate configs. The maximum setting is 2 GB to avoid exceeding the cudf row count limit of a column.
1073741824
Runtime
spark.rapids.sql.concurrentGpuTasks
Set the number of tasks that can execute concurrently per GPU. Tasks may temporarily block when the number of concurrent tasks in the executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors.
2
Runtime
spark.rapids.sql.enabled
Enable (true) or disable (false) sql operations on the GPU
true
Runtime
spark.rapids.sql.explain
Explain why some parts of a query were not placed on a GPU or not. Possible values are ALL: print everything, NONE: print nothing, NOT_ON_GPU: print only parts of a query that did not go on the GPU
NOT_ON_GPU
Runtime
spark.rapids.sql.metrics.level
GPU plans can produce a lot more metrics than CPU plans do. In very large queries this can sometimes result in going over the max result size limit for the driver. Supported values include DEBUG which will enable all metrics supported and typically only needs to be enabled when debugging the plugin. MODERATE which should output enough metrics to understand how long each part of the query is taking and how much data is going to each part of the query. ESSENTIAL which disables most metrics except those Apache Spark CPU plans will also report or their equivalents.
MODERATE
Runtime
spark.rapids.sql.multiThreadedRead.numThreads
The maximum number of threads on each executor to use for reading small files in parallel. This can not be changed at runtime after the executor has started. Used with COALESCING and MULTITHREADED readers, see spark.rapids.sql.format.parquet.reader.type, spark.rapids.sql.format.orc.reader.type, or spark.rapids.sql.format.avro.reader.type for a discussion of reader types. If it is not set explicitly and spark.executor.cores is set, it will be tried to assign value of max(MULTITHREAD_READ_NUM_THREADS_DEFAULT, spark.executor.cores), where MULTITHREAD_READ_NUM_THREADS_DEFAULT = 20.
20
Startup
spark.rapids.sql.reader.batchSizeBytes
Soft limit on the maximum number of bytes the reader reads per batch. The readers will read chunks of data until this limit is met or exceeded. Note that the reader may estimate the number of bytes that will be used on the GPU in some cases based on the schema and number of rows in each batch.
2147483647
Runtime
spark.rapids.sql.reader.batchSizeRows
Soft limit on the maximum number of rows the reader will read per batch. The orc and parquet readers will read row groups until this limit is met or exceeded. The limit is respected by the csv reader.
2147483647
Runtime
spark.rapids.sql.shuffle.spillThreads
Number of threads used to spill shuffle data to disk in the background.
6
Runtime
spark.rapids.sql.udfCompiler.enabled
When set to true, Scala UDFs will be considered for compilation as Catalyst expressions