RAPIDS Accelerator for Apache Spark Advanced Configuration

Most users will not need to modify the configuration options listed below. They are documented here for completeness and advanced usage.

The following configuration options are supported by the RAPIDS Accelerator for Apache Spark.

For commonly used configurations and examples of setting options, please refer to the RAPIDS Accelerator for Configuration page.

Advanced Configuration

Name	Description	Default Value	Applicable at
spark.rapids.alluxio.automount.enabled	Enable the feature of auto mounting the cloud storage to Alluxio. It requires the Alluxio master is the same node of Spark driver node. The Alluxio master’s host and port will be read from alluxio.master.hostname and alluxio.master.rpc.port(default: 19998) from ALLUXIO_HOME/conf/alluxio-site.properties, then replace a cloud path which matches spark.rapids.alluxio.bucket.regex like “s3://bar/b.csv” to “alluxio://0.1.2.3:19998/bar/b.csv”, and the bucket “s3://bar” will be mounted to “/bar” in Alluxio automatically.	false	Runtime
spark.rapids.alluxio.bucket.regex	A regex to decide which bucket should be auto-mounted to Alluxio. E.g. when setting as “^s3://bucket.*”, the bucket which starts with “s3://bucket” will be mounted to Alluxio and the path “s3://bucket-foo/a.csv” will be replaced to “alluxio://0.1.2.3:19998/bucket-foo/a.csv”. It’s only valid when setting spark.rapids.alluxio.automount.enabled=true. The default value matches all the buckets in “s3://” or “s3a://” scheme.	^s3a{0,1}://.*	Runtime
spark.rapids.alluxio.home	The Alluxio installation home path or link to the installation home path.	/opt/alluxio	Startup
spark.rapids.alluxio.large.file.threshold	The threshold is used to identify whether average size of files is large when reading from S3. If reading large files from S3 and the disks used by Alluxio are slow, directly reading from S3 is better than reading caches from Alluxio, because S3 network bandwidth is faster than local disk. This improvement takes effect when spark.rapids.alluxio.slow.disk is enabled.	67108864	Runtime
spark.rapids.alluxio.master	The Alluxio master hostname. If not set, read Alluxio master URL from spark.rapids.alluxio.home locally. This config is useful when Alluxio master and Spark driver are not co-located.		Startup
spark.rapids.alluxio.master.port	The Alluxio master port. If not set, read Alluxio master port from spark.rapids.alluxio.home locally. This config is useful when Alluxio master and Spark driver are not co-located.	19998	Startup
spark.rapids.alluxio.pathsToReplace	List of paths to be replaced with corresponding Alluxio scheme. E.g. when configure is set to “s3://foo->alluxio://0.1.2.3:19998/foo,gs://bar->alluxio://0.1.2.3:19998/bar”, it means: “s3://foo/a.csv” will be replaced to “alluxio://0.1.2.3:19998/foo/a.csv” and “gs://bar/b.csv” will be replaced to “alluxio://0.1.2.3:19998/bar/b.csv”. To use this config, you have to mount the buckets to Alluxio by yourself. If you set this config, spark.rapids.alluxio.automount.enabled won’t be valid.	None	Startup
spark.rapids.alluxio.replacement.algo	The algorithm used when replacing the UFS path with the Alluxio path. CONVERT_TIME and TASK_TIME are the valid options. CONVERT_TIME indicates that we do it when we convert it to a GPU file read, this has extra overhead of creating an entirely new file index, which requires listing the files and getting all new file info from Alluxio. TASK_TIME replaces the path as late as possible inside of the task. By waiting and replacing it at task time, it just replaces the path without fetching the file information again, this is faster but doesn’t update locality information if that has a bit impact on performance.	TASK_TIME	Runtime
spark.rapids.alluxio.slow.disk	Indicates whether the disks used by Alluxio are slow. If it’s true and reading S3 large files, Rapids Accelerator reads from S3 directly instead of reading from Alluxio caches. Refer to spark.rapids.alluxio.large.file.threshold which defines a threshold that identifying whether files are large. Typically, it’s slow disks if speed is less than 300M/second. If using convert time spark.rapids.alluxio.replacement.algo, this may not apply to all file types like Delta files	true	Runtime
spark.rapids.alluxio.user	Alluxio user is set on the Alluxio client, which is used to mount or get information. By default it should be the user that running the Alluxio processes. The default value is ubuntu.	ubuntu	Runtime
spark.rapids.filecache.allowPathRegexp	A regular expression to decide which paths will be cached when the file cache is enabled. If this is not set, then all paths are allowed to cache. If a path is allowed by this regexp but blocked by spark.rapids.filecache.blockPathRegexp, then the path is blocked to cache.	None	Startup
spark.rapids.filecache.blockPathRegexp	A regular expression to decide which paths will not be cached when the file cache is enabled. If a path is blocked by this regexp but is allowed by spark.rapids.filecache.allowPathRegexp, then the path is blocked.	None	Startup
spark.rapids.filecache.checkStale	Controls whether the cached is checked for being out of date with respect to the input file. When enabled, the data that has been cached locally for a file will be invalidated if the file is updated after being cached. This feature is only necessary if an input file for a Spark application can be changed during the lifetime of the application. If an individual input file will not be overwritten during the Spark application then performance may be improved by setting this to false.	true	Startup
spark.rapids.filecache.maxBytes	Controls the maximum amount of data that will be cached locally. If left unspecified, it will use half of the available disk space detected on startup for the configured Spark local disks.	None	Startup
spark.rapids.filecache.useChecksums	Whether to write out and verify checksums for the cached local files.	false	Startup
spark.rapids.gpu.resourceName	The name of the Spark resource that represents a GPU that you want the plugin to use if using custom resources with Spark.	gpu	Startup
spark.rapids.memory.gpu.allocFraction	The fraction of available (free) GPU memory that should be allocated for pooled memory. This must be less than or equal to the maximum limit configured via spark.rapids.memory.gpu.maxAllocFraction, and greater than or equal to the minimum limit configured via spark.rapids.memory.gpu.minAllocFraction.	1.0	Startup
spark.rapids.memory.gpu.debug	Provides a log of GPU memory allocations and frees. If set to STDOUT or STDERR the logging will go there. Setting it to NONE disables logging. All other values are reserved for possible future expansion and in the mean time will disable logging.	NONE	Startup
spark.rapids.memory.gpu.oomDumpDir	The path to a local directory where a heap dump will be created if the GPU encounters an unrecoverable out-of-memory (OOM) error. The filename will be of the form: “gpu-oom--.hprof" where is the process ID, and the dumpId is a sequence number to disambiguate multiple heap dumps per process lifecycle	None	Startup
spark.rapids.memory.gpu.pool	Select the RMM pooling allocator to use. Valid values are “DEFAULT”, “ARENA”, “ASYNC”, and “NONE”. With “DEFAULT”, the RMM pool allocator is used; with “ARENA”, the RMM arena allocator is used; with “ASYNC”, the new CUDA stream-ordered memory allocator in CUDA 11.2+ is used. If set to “NONE”, pooling is disabled and RMM just passes through to CUDA memory allocation directly.	ASYNC	Startup
spark.rapids.memory.gpu.pooling.enabled	Should RMM act as a pooling allocator for GPU memory, or should it just pass through to CUDA memory allocation directly. DEPRECATED: please use spark.rapids.memory.gpu.pool instead.	true	Startup
spark.rapids.memory.gpu.reserve	The amount of GPU memory that should remain unallocated by RMM and left for system use such as memory needed for kernels and kernel launches.	671088640	Startup
spark.rapids.memory.gpu.state.debug	To better recover from out of memory errors, RMM will track several states for the threads that interact with the GPU. This provides a log of those state transitions to aid in debugging it. STDOUT or STDERR will have the logging go there empty string will disable logging and anything else will be treated as a file to write the logs to.		Startup
spark.rapids.memory.gpu.unspill.enabled	When a spilled GPU buffer is needed again, should it be unspilled, or only copied back into GPU memory temporarily. Unspilling may be useful for GPU buffers that are needed frequently, for example, broadcast variables; however, it may also increase GPU memory usage	false	Startup
spark.rapids.perfio.s3.enabled	When true, enables an AWS S3 reader for improved performance in certain queries. The presence of AWS SDK packages for Netty and/or CRT HTTP clients on the classpath is required. You can use Spark submit option `--packages software.amazon.awssdk:s3:2.22.12,software.amazon.awssdk:aws-crt-client:2.22.12` to achieve this. See https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html#crt-based-s3-client-depend	false	Startup
spark.rapids.python.concurrentPythonWorkers	Set the number of Python worker processes that can execute concurrently per GPU. Python worker processes may temporarily block when the number of concurrent Python worker processes started by the same executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors. >0 means enabled, while <=0 means unlimited	0	Runtime
spark.rapids.python.memory.gpu.allocFraction	The fraction of total GPU memory that should be initially allocated for pooled memory for all the Python workers. It supposes to be less than (1 - $(spark.rapids.memory.gpu.allocFraction)), since the executor will share the GPU with its owning Python workers. Half of the rest will be used if not specified	None	Runtime
spark.rapids.python.memory.gpu.maxAllocFraction	The fraction of total GPU memory that limits the maximum size of the RMM pool for all the Python workers. It supposes to be less than (1 - $(spark.rapids.memory.gpu.maxAllocFraction)), since the executor will share the GPU with its owning Python workers. when setting to 0 it means no limit.	0.0	Runtime
spark.rapids.python.memory.gpu.pooling.enabled	Should RMM in Python workers act as a pooling allocator for GPU memory, or should it just pass through to CUDA memory allocation directly. When not specified, It will honor the value of config ‘spark.rapids.memory.gpu.pooling.enabled’	None	Runtime
spark.rapids.shuffle.enabled	Enable or disable the RAPIDS Shuffle Manager at runtime. The RAPIDS Shuffle Manager must already be configured. When set to `false`, the built-in Spark shuffle will be used.	true	Runtime
spark.rapids.shuffle.mode	RAPIDS Shuffle Manager mode. “MULTITHREADED”: shuffle file writes and reads are parallelized using a thread pool. “UCX”: (requires UCX installation) uses accelerated transports for transferring shuffle blocks. “CACHE_ONLY”: use when running a single executor, for short-circuit cached shuffle (for testing purposes).	MULTITHREADED	Startup
spark.rapids.shuffle.multiThreaded.maxBytesInFlight	The size limit, in bytes, that the RAPIDS shuffle manager configured in “MULTITHREADED” mode will allow to be serialized or deserialized concurrently per task. This is also the maximum amount of memory that will be used per task. This should be set larger than Spark’s default maxBytesInFlight (48MB). The larger this setting is, the more compressed shuffle chunks are processed concurrently. In practice, care needs to be taken to not go over the amount of off-heap memory that Netty has available. See https://github.com/NVIDIA/spark-rapids/issues/9153.	134217728	Startup
spark.rapids.shuffle.multiThreaded.reader.threads	The number of threads to use for reading shuffle blocks per executor in the RAPIDS shuffle manager configured in “MULTITHREADED” mode. There are two special values: 0 = feature is disabled, falls back to Spark built-in shuffle reader; 1 = our implementation of Spark’s built-in shuffle reader with extra metrics.	20	Startup
spark.rapids.shuffle.multiThreaded.writer.threads	The number of threads to use for writing shuffle blocks per executor in the RAPIDS shuffle manager configured in “MULTITHREADED” mode. There are two special values: 0 = feature is disabled, falls back to Spark built-in shuffle writer; 1 = our implementation of Spark’s built-in shuffle writer with extra metrics.	20	Startup
spark.rapids.shuffle.transport.earlyStart	Enable early connection establishment for RAPIDS Shuffle	true	Startup
spark.rapids.shuffle.transport.earlyStart.heartbeatInterval	Shuffle early start heartbeat interval (milliseconds). Executors will send a heartbeat RPC message to the driver at this interval	5000	Startup
spark.rapids.shuffle.transport.earlyStart.heartbeatTimeout	Shuffle early start heartbeat timeout (milliseconds). Executors that don’t heartbeat within this timeout will be considered stale. This timeout must be higher than the value for spark.rapids.shuffle.transport.earlyStart.heartbeatInterval	10000	Startup
spark.rapids.shuffle.transport.maxReceiveInflightBytes	Maximum aggregate amount of bytes that be fetched at any given time from peers during shuffle	1073741824	Startup
spark.rapids.shuffle.ucx.activeMessages.forceRndv	Set to true to force ‘rndv’ mode for all UCX Active Messages. This should only be required with UCX 1.10.x. UCX 1.11.x deployments should set to false.	false	Startup
spark.rapids.shuffle.ucx.managementServerHost	The host to be used to start the management server	null	Startup
spark.rapids.shuffle.ucx.useWakeup	When set to true, use UCX’s event-based progress (epoll) in order to wake up the progress thread when needed, instead of a hot loop.	true	Startup
spark.rapids.sql.allowMultipleJars	Allow multiple rapids-4-spark, spark-rapids-jni, and cudf jars on the classpath. Spark will take the first one it finds, so the version may not be expected. Possisble values are ALWAYS: allow all jars, SAME_REVISION: only allow jars with the same revision, NEVER: do not allow multiple jars at all.	SAME_REVISION	Startup
spark.rapids.sql.castDecimalToFloat.enabled	Casting from decimal to floating point types on the GPU returns results that have tiny difference compared to results returned from CPU.	true	Runtime
spark.rapids.sql.castFloatToDecimal.enabled	Casting from floating point types to decimal on the GPU returns results that have tiny difference compared to results returned from CPU.	true	Runtime
spark.rapids.sql.castFloatToIntegralTypes.enabled	Casting from floating point types to integral types on the GPU supports a slightly different range of values when using Spark 3.1.0 or later. Refer to the CAST documentation for more details.	true	Runtime
spark.rapids.sql.castFloatToString.enabled	Casting from floating point types to string on the GPU returns results that have a different precision than the default results of Spark.	true	Runtime
spark.rapids.sql.castStringToFloat.enabled	When set to true, enables casting from strings to float types (float, double) on the GPU. Currently hex values aren’t supported on the GPU. Also note that casting from string to float types on the GPU returns incorrect results when the string represents any number “1.7976931348623158E308” <= x < “1.7976931348623159E308” and “-1.7976931348623158E308” >= x > “-1.7976931348623159E308” in both these cases the GPU returns Double.MaxValue while CPU returns “+Infinity” and “-Infinity” respectively	true	Runtime
spark.rapids.sql.castStringToTimestamp.enabled	When set to true, casting from string to timestamp is supported on the GPU. The GPU only supports a subset of formats when casting strings to timestamps. Refer to the CAST documentation for more details.	false	Runtime
spark.rapids.sql.coalescing.reader.numFilterParallel	This controls the number of files the coalescing reader will run in each thread when it filters blocks for reading. If this value is greater than zero the files will be filtered in a multithreaded manner where each thread filters the number of files set by this config. If this is set to zero the files are filtered serially. This uses the same thread pool as the multithreaded reader, see spark.rapids.sql.multiThreadedRead.numThreads. Note that filtering multithreaded is useful with Alluxio.	0	Runtime
spark.rapids.sql.concurrentWriterPartitionFlushSize	The flush size of the concurrent writer cache in bytes for each partition. If specified spark.sql.maxConcurrentOutputFileWriters, use concurrent writer to write data. Concurrent writer first caches data for each partition and begins to flush the data if it finds one partition with a size that is greater than or equal to this config. The default value is 0, which will try to select a size based off of file type specific configs. E.g.: It uses `write.parquet.row-group-size-bytes` config for Parquet type and `orc.stripe.size` config for Orc type. If the value is greater than 0, will use this positive value.Max value may get better performance but not always, because concurrent writer uses spillable cache and big value may cause more IO swaps.	0	Runtime
spark.rapids.sql.csv.read.decimal.enabled	CSV reading is not 100% compatible when reading decimals.	false	Runtime
spark.rapids.sql.csv.read.double.enabled	CSV reading is not 100% compatible when reading doubles.	true	Runtime
spark.rapids.sql.csv.read.float.enabled	CSV reading is not 100% compatible when reading floats.	true	Runtime
spark.rapids.sql.decimalOverflowGuarantees	FOR TESTING ONLY. DO NOT USE IN PRODUCTION. Please see the decimal section of the compatibility documents for more information on this config.	true	Runtime
spark.rapids.sql.detectDeltaCheckpointQueries	Queries against Delta Lake _delta_log checkpoint Parquet files are not efficient on the GPU. When this option is enabled, the plugin will attempt to detect these queries and fall back to the CPU.	true	Runtime
spark.rapids.sql.detectDeltaLogQueries	Queries against Delta Lake _delta_log JSON files are not efficient on the GPU. When this option is enabled, the plugin will attempt to detect these queries and fall back to the CPU.	true	Runtime
spark.rapids.sql.fast.sample	Option to turn on fast sample. If enable it is inconsistent with CPU sample because of GPU sample algorithm is inconsistent with CPU.	false	Runtime
spark.rapids.sql.format.avro.enabled	When set to true enables all avro input and output acceleration. (only input is currently supported anyways)	false	Runtime
spark.rapids.sql.format.avro.multiThreadedRead.maxNumFilesParallel	A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.avro.reader.type.	2147483647	Runtime
spark.rapids.sql.format.avro.multiThreadedRead.numThreads	The maximum number of threads, on one executor, to use for reading small Avro files in parallel. This can not be changed at runtime after the executor has started. Used with MULTITHREADED reader, see spark.rapids.sql.format.avro.reader.type. DEPRECATED: use spark.rapids.sql.multiThreadedRead.numThreads	None	Startup
spark.rapids.sql.format.avro.read.enabled	When set to true enables avro input acceleration	false	Runtime
spark.rapids.sql.format.avro.reader.type	Sets the Avro reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.multiThreadedRead.numThreads and spark.rapids.sql.format.avro.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes.	AUTO	Runtime
spark.rapids.sql.format.csv.enabled	When set to false disables all csv input and output acceleration. (only input is currently supported anyways)	true	Runtime
spark.rapids.sql.format.csv.read.enabled	When set to false disables csv input acceleration	true	Runtime
spark.rapids.sql.format.delta.write.enabled	When set to false disables Delta Lake output acceleration.	true	Runtime
spark.rapids.sql.format.hive.text.enabled	When set to false disables Hive text table acceleration	true	Runtime
spark.rapids.sql.format.hive.text.read.decimal.enabled	Hive text file reading is not 100% compatible when reading decimals. Hive has more limitations on what is valid compared to the GPU implementation in some corner cases. See https://github.com/NVIDIA/spark-rapids/issues/7246	true	Runtime
spark.rapids.sql.format.hive.text.read.double.enabled	Hive text file reading is not 100% compatible when reading doubles.	true	Runtime
spark.rapids.sql.format.hive.text.read.enabled	When set to false disables Hive text table read acceleration	true	Runtime
spark.rapids.sql.format.hive.text.read.float.enabled	Hive text file reading is not 100% compatible when reading floats.	true	Runtime
spark.rapids.sql.format.hive.text.write.enabled	When set to false disables Hive text table write acceleration	false	Runtime
spark.rapids.sql.format.iceberg.enabled	When set to false disables all Iceberg acceleration	true	Runtime
spark.rapids.sql.format.iceberg.read.enabled	When set to false disables Iceberg input acceleration	true	Runtime
spark.rapids.sql.format.json.enabled	When set to true enables all json input and output acceleration. (only input is currently supported anyways)	false	Runtime
spark.rapids.sql.format.json.read.enabled	When set to true enables json input acceleration	false	Runtime
spark.rapids.sql.format.orc.enabled	When set to false disables all orc input and output acceleration	true	Runtime
spark.rapids.sql.format.orc.floatTypesToString.enable	When reading an ORC file, the source data schemas(schemas of ORC file) may differ from the target schemas (schemas of the reader), we need to handle the castings from source type to target type. Since float/double numbers in GPU have different precision with CPU, when casting float/double to string, the result of GPU is different from result of CPU spark. Its default value is `true` (this means the strings result will differ from result of CPU). If it’s set `false` explicitly and there exists casting from float/double to string in the job, then such behavior will cause an exception, and the job will fail.	true	Runtime
spark.rapids.sql.format.orc.multiThreadedRead.maxNumFilesParallel	A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.orc.reader.type.	2147483647	Runtime
spark.rapids.sql.format.orc.multiThreadedRead.numThreads	The maximum number of threads, on the executor, to use for reading small ORC files in parallel. This can not be changed at runtime after the executor has started. Used with MULTITHREADED reader, see spark.rapids.sql.format.orc.reader.type. DEPRECATED: use spark.rapids.sql.multiThreadedRead.numThreads	None	Startup
spark.rapids.sql.format.orc.read.enabled	When set to false disables orc input acceleration	true	Runtime
spark.rapids.sql.format.orc.reader.type	Sets the ORC reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.multiThreadedRead.numThreads and spark.rapids.sql.format.orc.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes.	AUTO	Runtime
spark.rapids.sql.format.orc.write.enabled	When set to false disables orc output acceleration	true	Runtime
spark.rapids.sql.format.parquet.enabled	When set to false disables all parquet input and output acceleration	true	Runtime
spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel	A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type.	2147483647	Runtime
spark.rapids.sql.format.parquet.multiThreadedRead.numThreads	The maximum number of threads, on the executor, to use for reading small Parquet files in parallel. This can not be changed at runtime after the executor has started. Used with COALESCING and MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type. DEPRECATED: use spark.rapids.sql.multiThreadedRead.numThreads	None	Startup
spark.rapids.sql.format.parquet.multithreaded.combine.sizeBytes	The target size in bytes to combine multiple small files together when using the MULTITHREADED parquet reader. With combine disabled, the MULTITHREADED reader reads the files in parallel and sends individual files down to the GPU, but that can be inefficient for small files. When combine is enabled, files that are ready within spark.rapids.sql.format.parquet.multithreaded.combine.waitTime together, up to this threshold size, are combined before sending down to GPU. This can be disabled by setting it to 0. Note that combine also will not go over the spark.rapids.sql.reader.batchSizeRows or spark.rapids.sql.reader.batchSizeBytes limits. DEPRECATED: use spark.rapids.sql.reader.multithreaded.combine.sizeBytes instead.	None	Runtime
spark.rapids.sql.format.parquet.multithreaded.combine.waitTime	When using the multithreaded parquet reader with combine mode, how long to wait, in milliseconds, for more files to finish if haven’t met the size threshold. Note that this will wait this amount of time from when the last file was available, so total wait time could be larger then this. DEPRECATED: use spark.rapids.sql.reader.multithreaded.combine.waitTime instead.	None	Runtime
spark.rapids.sql.format.parquet.multithreaded.read.keepOrder	When using the MULTITHREADED reader, if this is set to true we read the files in the same order Spark does, otherwise the order may not be the same. DEPRECATED: use spark.rapids.sql.reader.multithreaded.read.keepOrder instead.	None	Runtime
spark.rapids.sql.format.parquet.read.enabled	When set to false disables parquet input acceleration	true	Runtime
spark.rapids.sql.format.parquet.reader.footer.type	In some cases reading the footer of the file is very expensive. Typically this happens when there are a large number of columns and relatively few of them are being read on a large number of files. This provides the ability to use a different path to parse and filter the footer. AUTO is the default and decides which path to take using a heuristic. JAVA follows closely with what Apache Spark does. NATIVE will parse and filter the footer using C++.	AUTO	Runtime
spark.rapids.sql.format.parquet.reader.type	Sets the Parquet reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.multiThreadedRead.numThreads and spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes.	AUTO	Runtime
spark.rapids.sql.format.parquet.write.enabled	When set to false disables parquet output acceleration	true	Runtime
spark.rapids.sql.format.parquet.writer.int96.enabled	When set to false, disables accelerated parquet write if the spark.sql.parquet.outputTimestampType is set to INT96	true	Runtime
spark.rapids.sql.formatNumberFloat.enabled	format_number with floating point types on the GPU returns results that have a different precision than the default results of Spark.	true	Runtime
spark.rapids.sql.hasExtendedYearValues	Spark 3.2.0+ extended parsing of years in dates and timestamps to support the full range of possible values. Prior to this it was limited to a positive 4 digit year. The Accelerator does not support the extended range yet. This config indicates if your data includes this extended range or not, or if you don’t care about getting the correct values on values with the extended range.	true	Runtime
spark.rapids.sql.hashOptimizeSort.enabled	Whether sorts should be inserted after some hashed operations to improve output ordering. This can improve output file sizes when saving to columnar formats.	false	Runtime
spark.rapids.sql.improvedFloatOps.enabled	For some floating point operations spark uses one way to compute the value and the underlying cudf implementation can use an improved algorithm. In some cases this can result in cudf producing an answer when spark overflows.	true	Runtime
spark.rapids.sql.incompatibleDateFormats.enabled	When parsing strings as dates and timestamps in functions like unix_timestamp, some formats are fully supported on the GPU and some are unsupported and will fall back to the CPU. Some formats behave differently on the GPU than the CPU. Spark on the CPU interprets date formats with unsupported trailing characters as nulls, while Spark on the GPU will parse the date with invalid trailing characters. More detail can be found at parsing strings as dates or timestamps.	false	Runtime
spark.rapids.sql.incompatibleOps.enabled	For operations that work, but are not 100% compatible with the Spark equivalent set if they should be enabled by default or disabled by default.	true	Runtime
spark.rapids.sql.join.cross.enabled	When set to true cross joins are enabled on the GPU	true	Runtime
spark.rapids.sql.join.existence.enabled	When set to true existence joins are enabled on the GPU	true	Runtime
spark.rapids.sql.join.fullOuter.enabled	When set to true full outer joins are enabled on the GPU	true	Runtime
spark.rapids.sql.join.inner.enabled	When set to true inner joins are enabled on the GPU	true	Runtime
spark.rapids.sql.join.leftAnti.enabled	When set to true left anti joins are enabled on the GPU	true	Runtime
spark.rapids.sql.join.leftOuter.enabled	When set to true left outer joins are enabled on the GPU	true	Runtime
spark.rapids.sql.join.leftSemi.enabled	When set to true left semi joins are enabled on the GPU	true	Runtime
spark.rapids.sql.join.rightOuter.enabled	When set to true right outer joins are enabled on the GPU	true	Runtime
spark.rapids.sql.json.read.decimal.enabled	When reading a quoted string as a decimal Spark supports reading non-ascii unicode digits, and the RAPIDS Accelerator does not.	true	Runtime
spark.rapids.sql.json.read.double.enabled	JSON reading is not 100% compatible when reading doubles.	true	Runtime
spark.rapids.sql.json.read.float.enabled	JSON reading is not 100% compatible when reading floats.	true	Runtime
spark.rapids.sql.mode	Set the mode for the Rapids Accelerator. The supported modes are explainOnly and executeOnGPU. This config can not be changed at runtime, you must restart the application for it to take affect. The default mode is executeOnGPU, which means the RAPIDS Accelerator plugin convert the Spark operations and execute them on the GPU when possible. The explainOnly mode allows running queries on the CPU and the RAPIDS Accelerator will evaluate the queries as if it was going to run on the GPU. The explanations of what would have run on the GPU and why are output in log messages. When using explainOnly mode, the default explain output is ALL, this can be changed by setting spark.rapids.sql.explain. See that config for more details.	executeongpu	Startup
spark.rapids.sql.optimizer.joinReorder.enabled	When enabled, joins may be reordered for improved query performance	true	Runtime
spark.rapids.sql.python.gpu.enabled	This is an experimental feature and is likely to change in the future. Enable (true) or disable (false) support for scheduling Python Pandas UDFs with GPU resources. When enabled, pandas UDFs are assumed to share the same GPU that the RAPIDs accelerator uses and will honor the python GPU configs	false	Runtime
spark.rapids.sql.reader.chunked	Enable a chunked reader where possible. A chunked reader allows reading highly compressed data that could not be read otherwise, but at the expense of more GPU memory, and in some cases more GPU computation. Currently this only supports ORC and Parquet formats.	true	Runtime
spark.rapids.sql.reader.chunked.limitMemoryUsage	Enable a soft limit on the internal memory usage of the chunked reader (if being used). Such limit is calculated as the multiplication of ‘spark.rapids.sql.batchSizeBytes’ and ‘spark.rapids.sql.reader.chunked.memoryUsageRatio’.For example, if batchSizeBytes is set to 1GB and memoryUsageRatio is 4, the chunked reader will try to keep its memory usage under 4GB.	None	Runtime
spark.rapids.sql.reader.chunked.subPage	Enable a chunked reader where possible for reading data that is smaller than the typical row group/page limit. Currently deprecated and replaced by ‘spark.rapids.sql.reader.chunked.limitMemoryUsage’.	None	Runtime
spark.rapids.sql.reader.multithreaded.combine.sizeBytes	The target size in bytes to combine multiple small files together when using the MULTITHREADED parquet or orc reader. With combine disabled, the MULTITHREADED reader reads the files in parallel and sends individual files down to the GPU, but that can be inefficient for small files. When combine is enabled, files that are ready within spark.rapids.sql.reader.multithreaded.combine.waitTime together, up to this threshold size, are combined before sending down to GPU. This can be disabled by setting it to 0. Note that combine also will not go over the spark.rapids.sql.reader.batchSizeRows or spark.rapids.sql.reader.batchSizeBytes limits.	67108864	Runtime
spark.rapids.sql.reader.multithreaded.combine.waitTime	When using the multithreaded parquet or orc reader with combine mode, how long to wait, in milliseconds, for more files to finish if haven’t met the size threshold. Note that this will wait this amount of time from when the last file was available, so total wait time could be larger then this.	200	Runtime
spark.rapids.sql.reader.multithreaded.read.keepOrder	When using the MULTITHREADED reader, if this is set to true we read the files in the same order Spark does, otherwise the order may not be the same. Now it is supported only for parquet and orc.	true	Runtime
spark.rapids.sql.regexp.enabled	Specifies whether supported regular expressions will be evaluated on the GPU. Unsupported expressions will fall back to CPU. However, there are some known edge cases that will still execute on GPU and produce incorrect results and these are documented in the compatibility guide. Setting this config to false will make all regular expressions run on the CPU instead.	true	Runtime
spark.rapids.sql.regexp.maxStateMemoryBytes	Specifies the maximum memory on GPU to be used for regular expressions.The memory usage is an estimate based on an upper-bound approximation on the complexity of the regular expression. Note that the actual memory usage may still be higher than this estimate depending on the number of rows in the datacolumn and the input strings themselves. It is recommended to not set this to more than 3 times spark.rapids.sql.batchSizeBytes	2147483647	Runtime
spark.rapids.sql.replaceSortMergeJoin.enabled	Allow replacing sortMergeJoin with HashJoin	true	Runtime
spark.rapids.sql.rowBasedUDF.enabled	When set to true, optimizes a row-based UDF in a GPU operation by transferring only the data it needs between GPU and CPU inside a query operation, instead of falling this operation back to CPU. This is an experimental feature, and this config might be removed in the future.	false	Runtime
spark.rapids.sql.stableSort.enabled	Enable or disable stable sorting. Apache Spark’s sorting is typically a stable sort, but sort stability cannot be guaranteed in distributed work loads because the order in which upstream data arrives to a task is not guaranteed. Sort stability then only matters when reading and sorting data from a file using a single task/partition. Because of limitations in the plugin when you enable stable sorting all of the data for a single task will be combined into a single batch before sorting. This currently disables spilling from GPU memory if the data size is too large.	false	Runtime
spark.rapids.sql.suppressPlanningFailure	Option to fallback an individual query to CPU if an unexpected condition prevents the query plan from being converted to a GPU-enabled one. Note this is different from a normal CPU fallback for a yet-to-be-supported Spark SQL feature. If this happens the error should be reported and investigated as a GitHub issue.	false	Runtime
spark.rapids.sql.variableFloatAgg.enabled	Spark assumes that all operations produce the exact same result each time. This is not true for some floating point aggregations, which can produce slightly different results on the GPU as the aggregation is done in parallel. This can enable those operations if you know the query is only computing it once.	true	Runtime
spark.rapids.sql.window.batched.bounded.row.max	Max value for bounded row window preceding/following extents permissible for the window to be evaluated in batched mode. This value affects both the preceding and following bounds, potentially doubling the window size permitted for batched execution	100	Runtime
spark.rapids.sql.window.collectList.enabled	The output size of collect list for a window operation is proportional to the window size squared. The current GPU implementation does not handle this well and is disabled by default. If you know that your window size is very small you can try to enable it.	false	Runtime
spark.rapids.sql.window.collectSet.enabled	The output size of collect set for a window operation can be proportional to the window size squared. The current GPU implementation does not handle this well and is disabled by default. If you know that your window size is very small you can try to enable it.	false	Runtime
spark.rapids.sql.window.range.byte.enabled	When the order-by column of a range based window is byte type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the byte type order-by column	false	Runtime
spark.rapids.sql.window.range.decimal.enabled	When set to false, this disables the range window acceleration for the DECIMAL type order-by column	true	Runtime
spark.rapids.sql.window.range.double.enabled	When set to false, this disables the range window acceleration for the double type order-by column	true	Runtime
spark.rapids.sql.window.range.float.enabled	When set to false, this disables the range window acceleration for the FLOAT type order-by column	true	Runtime
spark.rapids.sql.window.range.int.enabled	When the order-by column of a range based window is int type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the int type order-by column	true	Runtime
spark.rapids.sql.window.range.long.enabled	When the order-by column of a range based window is long type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the long type order-by column	true	Runtime
spark.rapids.sql.window.range.short.enabled	When the order-by column of a range based window is short type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the short type order-by column	false	Runtime

Supported GPU Operators and Fine Tuning

The RAPIDS Accelerator for Apache Spark can be configured to enable or disable specific GPU accelerated expressions. Enabled expressions are candidates for GPU execution. If the expression is configured as disabled, the accelerator plugin will not attempt replacement, and it will run on the CPU.

Please leverage the spark.rapids.sql.explain setting to get feedback from the plugin as to why parts of a query may not be executing on the GPU.

NOTE: Setting spark.rapids.sql.incompatibleOps.enabled=true will enable all the settings in the table below which are not enabled by default due to incompatibilities.

Expressions

Name	SQL Function(s)	Description	Default Value	Notes
spark.rapids.sql.expression.Abs	`abs`	Absolute value	true	None
spark.rapids.sql.expression.Acos	`acos`	Inverse cosine	true	None
spark.rapids.sql.expression.Acosh	`acosh`	Inverse hyperbolic cosine	true	None
spark.rapids.sql.expression.Add	`+`	Addition	true	None
spark.rapids.sql.expression.Alias		Gives a column a name	true	None
spark.rapids.sql.expression.And	`and`	Logical AND	true	None
spark.rapids.sql.expression.AnsiCast		Convert a column of one type of data into another type	true	None
spark.rapids.sql.expression.ArrayContains	`array_contains`	Returns a boolean if the array contains the passed in key	true	None
spark.rapids.sql.expression.ArrayExcept	`array_except`	Returns an array of the elements in array1 but not in array2, without duplicates	true	This is not 100% compatible with the Spark version because the GPU implementation treats -0.0 and 0.0 as equal, but the CPU implementation currently does not (see SPARK-39845). Also, Apache Spark 3.1.3 fixed issue SPARK-36741 where NaNs in these set like operators were not treated as being equal. We have chosen to break with compatibility for the older versions of Spark in this instance and handle NaNs the same as 3.1.3+
spark.rapids.sql.expression.ArrayExists	`exists`	Return true if any element satisfies the predicate LambdaFunction	true	None
spark.rapids.sql.expression.ArrayFilter	`filter`	Filter an input array using a given predicate	true	None
spark.rapids.sql.expression.ArrayIntersect	`array_intersect`	Returns an array of the elements in the intersection of array1 and array2, without duplicates	true	This is not 100% compatible with the Spark version because the GPU implementation treats -0.0 and 0.0 as equal, but the CPU implementation currently does not (see SPARK-39845). Also, Apache Spark 3.1.3 fixed issue SPARK-36741 where NaNs in these set like operators were not treated as being equal. We have chosen to break with compatibility for the older versions of Spark in this instance and handle NaNs the same as 3.1.3+
spark.rapids.sql.expression.ArrayMax	`array_max`	Returns the maximum value in the array	true	None
spark.rapids.sql.expression.ArrayMin	`array_min`	Returns the minimum value in the array	true	None
spark.rapids.sql.expression.ArrayRemove	`array_remove`	Returns the array after removing all elements that equal to the input element (right) from the input array (left)	true	None
spark.rapids.sql.expression.ArrayRepeat	`array_repeat`	Returns the array containing the given input value (left) count (right) times	true	None
spark.rapids.sql.expression.ArrayTransform	`transform`	Transform elements in an array using the transform function. This is similar to a `map` in functional programming	true	None
spark.rapids.sql.expression.ArrayUnion	`array_union`	Returns an array of the elements in the union of array1 and array2, without duplicates.	true	This is not 100% compatible with the Spark version because the GPU implementation treats -0.0 and 0.0 as equal, but the CPU implementation currently does not (see SPARK-39845). Also, Apache Spark 3.1.3 fixed issue SPARK-36741 where NaNs in these set like operators were not treated as being equal. We have chosen to break with compatibility for the older versions of Spark in this instance and handle NaNs the same as 3.1.3+
spark.rapids.sql.expression.ArraysOverlap	`arrays_overlap`	Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.	true	This is not 100% compatible with the Spark version because the GPU implementation treats -0.0 and 0.0 as equal, but the CPU implementation currently does not (see SPARK-39845). Also, Apache Spark 3.1.3 fixed issue SPARK-36741 where NaNs in these set like operators were not treated as being equal. We have chosen to break with compatibility for the older versions of Spark in this instance and handle NaNs the same as 3.1.3+
spark.rapids.sql.expression.ArraysZip	`arrays_zip`	Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.	true	None
spark.rapids.sql.expression.Ascii	`ascii`	The numeric value of the first character of string data.	false	This is disabled by default because it only supports strings starting with ASCII or Latin-1 characters after Spark 3.2.3, 3.3.1 and 3.4.0. Otherwise the results will not match the CPU.
spark.rapids.sql.expression.Asin	`asin`	Inverse sine	true	None
spark.rapids.sql.expression.Asinh	`asinh`	Inverse hyperbolic sine	true	None
spark.rapids.sql.expression.AtLeastNNonNulls		Checks if number of non null/Nan values is greater than a given value	true	None
spark.rapids.sql.expression.Atan	`atan`	Inverse tangent	true	None
spark.rapids.sql.expression.Atanh	`atanh`	Inverse hyperbolic tangent	true	None
spark.rapids.sql.expression.AttributeReference		References an input column	true	None
spark.rapids.sql.expression.BRound	`bround`	Round an expression to d decimal places using HALF_EVEN rounding mode	true	None
spark.rapids.sql.expression.BitLength	`bit_length`	The bit length of string data	true	None
spark.rapids.sql.expression.BitwiseAnd	`&`	Returns the bitwise AND of the operands	true	None
spark.rapids.sql.expression.BitwiseNot	`~`	Returns the bitwise NOT of the operands	true	None
spark.rapids.sql.expression.BitwiseOr	`\\|`	Returns the bitwise OR of the operands	true	None
spark.rapids.sql.expression.BitwiseXor	`^`	Returns the bitwise XOR of the operands	true	None
spark.rapids.sql.expression.BoundReference		Reference to a bound variable	true	None
spark.rapids.sql.expression.CaseWhen	`when`	CASE WHEN expression	true	None
spark.rapids.sql.expression.Cast	`bigint`, `binary`, `boolean`, `cast`, `date`, `decimal`, `double`, `float`, `int`, `smallint`, `string`, `timestamp`, `tinyint`	Convert a column of one type of data into another type	true	None
spark.rapids.sql.expression.Cbrt	`cbrt`	Cube root	true	None
spark.rapids.sql.expression.Ceil	`ceil`, `ceiling`	Ceiling of a number	true	None
spark.rapids.sql.expression.CheckOverflow		CheckOverflow after arithmetic operations between DecimalType data	true	None
spark.rapids.sql.expression.Coalesce	`coalesce`	Returns the first non-null argument if exists. Otherwise, null	true	None
spark.rapids.sql.expression.Concat	`concat`	List/String concatenate	true	None
spark.rapids.sql.expression.ConcatWs	`concat_ws`	Concatenates multiple input strings or array of strings into a single string using a given separator	true	None
spark.rapids.sql.expression.Contains		Contains	true	None
spark.rapids.sql.expression.Conv	`conv`	Convert string representing a number from one base to another	false	This is disabled by default because GPU implementation is incomplete. We currently only support from/to_base values of 10 and 16. We fall back on CPU if the signed conversion is signalled via a negative to_base. GPU implementation does not check for an 64-bit signed/unsigned int overflow when performing the conversion to return `FFFFFFFFFFFFFFFF` or `18446744073709551615` or to throw an error in the ANSI mode. It is safe to enable if the overflow is not possible or detected externally. For instance decimal strings not longer than 18 characters / hexadecimal strings not longer than 15 characters disregarding the sign cannot cause an overflow.
spark.rapids.sql.expression.Cos	`cos`	Cosine	true	None
spark.rapids.sql.expression.Cosh	`cosh`	Hyperbolic cosine	true	None
spark.rapids.sql.expression.Cot	`cot`	Cotangent	true	None
spark.rapids.sql.expression.CreateArray	`array`	Returns an array with the given elements	true	None
spark.rapids.sql.expression.CreateMap	`map`	Create a map	true	None
spark.rapids.sql.expression.CreateNamedStruct	`named_struct`, `struct`	Creates a struct with the given field names and values	true	None
spark.rapids.sql.expression.CurrentRow$		Special boundary for a window frame, indicating stopping at the current row	true	None
spark.rapids.sql.expression.DateAdd	`date_add`	Returns the date that is num_days after start_date	true	None
spark.rapids.sql.expression.DateAddInterval		Adds interval to date	true	None
spark.rapids.sql.expression.DateDiff	`datediff`	Returns the number of days from startDate to endDate	true	None
spark.rapids.sql.expression.DateFormatClass	`date_format`	Converts timestamp to a value of string in the format specified by the date format	true	None
spark.rapids.sql.expression.DateSub	`date_sub`	Returns the date that is num_days before start_date	true	None
spark.rapids.sql.expression.DayOfMonth	`day`, `dayofmonth`	Returns the day of the month from a date or timestamp	true	None
spark.rapids.sql.expression.DayOfWeek	`dayofweek`	Returns the day of the week (1 = Sunday…7=Saturday)	true	None
spark.rapids.sql.expression.DayOfYear	`dayofyear`	Returns the day of the year from a date or timestamp	true	None
spark.rapids.sql.expression.DenseRank	`dense_rank`	Window function that returns the dense rank value within the aggregation window	true	None
spark.rapids.sql.expression.Divide	`/`	Division	true	None
spark.rapids.sql.expression.DynamicPruningExpression		Dynamic pruning expression marker	true	None
spark.rapids.sql.expression.ElementAt	`element_at`	Returns element of array at given(1-based) index in value if column is array. Returns value for the given key in value if column is map.	true	None
spark.rapids.sql.expression.EndsWith		Ends with	true	None
spark.rapids.sql.expression.EqualNullSafe	`<=>`	Check if the values are equal including nulls <=>	true	None
spark.rapids.sql.expression.EqualTo	`==`, `=`	Check if the values are equal	true	None
spark.rapids.sql.expression.Exp	`exp`	Euler’s number e raised to a power	true	None
spark.rapids.sql.expression.Explode	`explode_outer`, `explode`	Given an input array produces a sequence of rows for each value in the array	true	None
spark.rapids.sql.expression.Expm1	`expm1`	Euler’s number e raised to a power minus 1	true	None
spark.rapids.sql.expression.Flatten	`flatten`	Creates a single array from an array of arrays	true	None
spark.rapids.sql.expression.Floor	`floor`	Floor of a number	true	None
spark.rapids.sql.expression.FormatNumber	`format_number`	Formats the number x like ‘#,###,###.##’, rounded to d decimal places.	true	None
spark.rapids.sql.expression.FromUTCTimestamp	`from_utc_timestamp`	Render the input UTC timestamp in the input timezone	true	None
spark.rapids.sql.expression.FromUnixTime	`from_unixtime`	Get the string from a unix timestamp	true	None
spark.rapids.sql.expression.GetArrayItem		Gets the field at `ordinal` in the Array	true	None
spark.rapids.sql.expression.GetArrayStructFields		Extracts the `ordinal`-th fields of all array elements for the data with the type of array of struct	true	None
spark.rapids.sql.expression.GetJsonObject	`get_json_object`	Extracts a json object from path	false	This is disabled by default because Experimental feature that could be unstable or have performance issues.
spark.rapids.sql.expression.GetMapValue		Gets Value from a Map based on a key	true	None
spark.rapids.sql.expression.GetStructField		Gets the named field of the struct	true	None
spark.rapids.sql.expression.GetTimestamp		Gets timestamps from strings using given pattern.	true	None
spark.rapids.sql.expression.GreaterThan	`>`	> operator	true	None
spark.rapids.sql.expression.GreaterThanOrEqual	`>=`	>= operator	true	None
spark.rapids.sql.expression.Greatest	`greatest`	Returns the greatest value of all parameters, skipping null values	true	None
spark.rapids.sql.expression.Hour	`hour`	Returns the hour component of the string/timestamp	true	None
spark.rapids.sql.expression.Hypot	`hypot`	Pythagorean addition (Hypotenuse) of real numbers	true	None
spark.rapids.sql.expression.If	`if`	IF expression	true	None
spark.rapids.sql.expression.In	`in`	IN operator	true	None
spark.rapids.sql.expression.InSet		INSET operator	true	None
spark.rapids.sql.expression.InitCap	`initcap`	Returns str with the first letter of each word in uppercase. All other letters are in lowercase	true	This is not 100% compatible with the Spark version because the Unicode version used by cuDF and the JVM may differ, resulting in some corner-case characters not changing case correctly.
spark.rapids.sql.expression.InputFileBlockLength	`input_file_block_length`	Returns the length of the block being read, or -1 if not available	true	None
spark.rapids.sql.expression.InputFileBlockStart	`input_file_block_start`	Returns the start offset of the block being read, or -1 if not available	true	None
spark.rapids.sql.expression.InputFileName	`input_file_name`	Returns the name of the file being read, or empty string if not available	true	None
spark.rapids.sql.expression.IntegralDivide	`div`	Division with a integer result	true	None
spark.rapids.sql.expression.IsNaN	`isnan`	Checks if a value is NaN	true	None
spark.rapids.sql.expression.IsNotNull	`isnotnull`	Checks if a value is not null	true	None
spark.rapids.sql.expression.IsNull	`isnull`	Checks if a value is null	true	None
spark.rapids.sql.expression.JsonToStructs	`from_json`	Returns a struct value with the given `jsonStr` and `schema`	false	This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the compatibility documentation to determine whether you can enable this configuration for your use case
spark.rapids.sql.expression.JsonTuple	`json_tuple`	Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.	false	This is disabled by default because Experimental feature that could be unstable or have performance issues.
spark.rapids.sql.expression.KnownFloatingPointNormalized		Tag to prevent redundant normalization	true	None
spark.rapids.sql.expression.KnownNotNull		Tag an expression as known to not be null	true	None
spark.rapids.sql.expression.Lag	`lag`	Window function that returns N entries behind this one	true	None
spark.rapids.sql.expression.LambdaFunction		Holds a higher order SQL function	true	None
spark.rapids.sql.expression.LastDay	`last_day`	Returns the last day of the month which the date belongs to	true	None
spark.rapids.sql.expression.Lead	`lead`	Window function that returns N entries ahead of this one	true	None
spark.rapids.sql.expression.Least	`least`	Returns the least value of all parameters, skipping null values	true	None
spark.rapids.sql.expression.Length	`char_length`, `character_length`, `length`	String character length or binary byte length	true	None
spark.rapids.sql.expression.LessThan	`<`	< operator	true	None
spark.rapids.sql.expression.LessThanOrEqual	`<=`	<= operator	true	None
spark.rapids.sql.expression.Like	`like`	Like	true	None
spark.rapids.sql.expression.Literal		Holds a static value from the query	true	None
spark.rapids.sql.expression.Log	`ln`	Natural log	true	None
spark.rapids.sql.expression.Log10	`log10`	Log base 10	true	None
spark.rapids.sql.expression.Log1p	`log1p`	Natural log 1 + expr	true	None
spark.rapids.sql.expression.Log2	`log2`	Log base 2	true	None
spark.rapids.sql.expression.Logarithm	`log`	Log variable base	true	None
spark.rapids.sql.expression.Lower	`lcase`, `lower`	String lowercase operator	true	This is not 100% compatible with the Spark version because the Unicode version used by cuDF and the JVM may differ, resulting in some corner-case characters not changing case correctly.
spark.rapids.sql.expression.MakeDecimal		Create a Decimal from an unscaled long value for some aggregation optimizations	true	None
spark.rapids.sql.expression.MapConcat	`map_concat`	Returns the union of all the given maps	true	None
spark.rapids.sql.expression.MapEntries	`map_entries`	Returns an unordered array of all entries in the given map	true	None
spark.rapids.sql.expression.MapFilter	`map_filter`	Filters entries in a map using the function	true	None
spark.rapids.sql.expression.MapKeys	`map_keys`	Returns an unordered array containing the keys of the map	true	None
spark.rapids.sql.expression.MapValues	`map_values`	Returns an unordered array containing the values of the map	true	None
spark.rapids.sql.expression.Md5	`md5`	MD5 hash operator	true	None
spark.rapids.sql.expression.MicrosToTimestamp	`timestamp_micros`	Converts the number of microseconds from unix epoch to a timestamp	true	None
spark.rapids.sql.expression.MillisToTimestamp	`timestamp_millis`	Converts the number of milliseconds from unix epoch to a timestamp	true	None
spark.rapids.sql.expression.Minute	`minute`	Returns the minute component of the string/timestamp	true	None
spark.rapids.sql.expression.MonotonicallyIncreasingID	`monotonically_increasing_id`	Returns monotonically increasing 64-bit integers	true	None
spark.rapids.sql.expression.Month	`month`	Returns the month from a date or timestamp	true	None
spark.rapids.sql.expression.Multiply	`*`	Multiplication	true	None
spark.rapids.sql.expression.Murmur3Hash	`hash`	Murmur3 hash operator	true	None
spark.rapids.sql.expression.NaNvl	`nanvl`	Evaluates to `left` iff left is not NaN, `right` otherwise	true	None
spark.rapids.sql.expression.NamedLambdaVariable		A parameter to a higher order SQL function	true	None
spark.rapids.sql.expression.Not	`!`, `not`	Boolean not operator	true	None
spark.rapids.sql.expression.NthValue	`nth_value`	nth window operator	true	None
spark.rapids.sql.expression.OctetLength	`octet_length`	The byte length of string data	true	None
spark.rapids.sql.expression.Or	`or`	Logical OR	true	None
spark.rapids.sql.expression.ParseUrl	`parse_url`	Extracts a part from a URL	true	None
spark.rapids.sql.expression.PercentRank	`percent_rank`	Window function that returns the percent rank value within the aggregation window	true	None
spark.rapids.sql.expression.Pmod	`pmod`	Pmod	true	None
spark.rapids.sql.expression.PosExplode	`posexplode_outer`, `posexplode`	Given an input array produces a sequence of rows for each value in the array	true	None
spark.rapids.sql.expression.Pow	`pow`, `power`	lhs ^ rhs	true	None
spark.rapids.sql.expression.PreciseTimestampConversion		Expression used internally to convert the TimestampType to Long and back without losing precision, i.e. in microseconds. Used in time windowing	true	None
spark.rapids.sql.expression.PromotePrecision		PromotePrecision before arithmetic operations between DecimalType data	true	None
spark.rapids.sql.expression.PythonUDF		UDF run in an external python process. Does not actually run on the GPU, but the transfer of data to/from it can be accelerated	true	None
spark.rapids.sql.expression.Quarter	`quarter`	Returns the quarter of the year for date, in the range 1 to 4	true	None
spark.rapids.sql.expression.RLike	`rlike`	Regular expression version of Like	true	None
spark.rapids.sql.expression.RaiseError	`raise_error`	Throw an exception	true	None
spark.rapids.sql.expression.Rand	`rand`, `random`	Generate a random column with i.i.d. uniformly distributed values in [0, 1)	true	None
spark.rapids.sql.expression.Rank	`rank`	Window function that returns the rank value within the aggregation window	true	None
spark.rapids.sql.expression.RegExpExtract	`regexp_extract`	Extract a specific group identified by a regular expression	true	None
spark.rapids.sql.expression.RegExpExtractAll	`regexp_extract_all`	Extract all strings matching a regular expression corresponding to the regex group index	true	None
spark.rapids.sql.expression.RegExpReplace	`regexp_replace`	String replace using a regular expression pattern	true	None
spark.rapids.sql.expression.Remainder	`%`, `mod`	Remainder or modulo	true	None
spark.rapids.sql.expression.ReplicateRows		Given an input row replicates the row N times	true	None
spark.rapids.sql.expression.Reverse	`reverse`	Returns a reversed string or an array with reverse order of elements	true	None
spark.rapids.sql.expression.Rint	`rint`	Rounds up a double value to the nearest double equal to an integer	true	None
spark.rapids.sql.expression.Round	`round`	Round an expression to d decimal places using HALF_UP rounding mode	true	None
spark.rapids.sql.expression.RowNumber	`row_number`	Window function that returns the index for the row within the aggregation window	true	None
spark.rapids.sql.expression.ScalaUDF		User Defined Function, the UDF can choose to implement a RAPIDS accelerated interface to get better performance.	true	None
spark.rapids.sql.expression.Second	`second`	Returns the second component of the string/timestamp	true	None
spark.rapids.sql.expression.SecondsToTimestamp	`timestamp_seconds`	Converts the number of seconds from unix epoch to a timestamp	true	None
spark.rapids.sql.expression.Sequence	`sequence`	Sequence	true	None
spark.rapids.sql.expression.ShiftLeft	`shiftleft`	Bitwise shift left («)	true	None
spark.rapids.sql.expression.ShiftRight	`shiftright`	Bitwise shift right (»)	true	None
spark.rapids.sql.expression.ShiftRightUnsigned	`shiftrightunsigned`	Bitwise unsigned shift right (»>)	true	None
spark.rapids.sql.expression.Signum	`sign`, `signum`	Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive	true	None
spark.rapids.sql.expression.Sin	`sin`	Sine	true	None
spark.rapids.sql.expression.Sinh	`sinh`	Hyperbolic sine	true	None
spark.rapids.sql.expression.Size	`cardinality`, `size`	The size of an array or a map	true	None
spark.rapids.sql.expression.SortArray	`sort_array`	Returns a sorted array with the input array and the ascending / descending order	true	None
spark.rapids.sql.expression.SortOrder		Sort order	true	None
spark.rapids.sql.expression.SparkPartitionID	`spark_partition_id`	Returns the current partition id	true	None
spark.rapids.sql.expression.SpecifiedWindowFrame		Specification of the width of the group (or “frame”) of input rows around which a window function is evaluated	true	None
spark.rapids.sql.expression.Sqrt	`sqrt`	Square root	true	None
spark.rapids.sql.expression.Stack	`stack`	Separates expr1, …, exprk into n rows.	true	None
spark.rapids.sql.expression.StartsWith		Starts with	true	None
spark.rapids.sql.expression.StringInstr	`instr`	Instr string operator	true	None
spark.rapids.sql.expression.StringLPad	`lpad`	Pad a string on the left	true	None
spark.rapids.sql.expression.StringLocate	`locate`, `position`	Substring search operator	true	None
spark.rapids.sql.expression.StringRPad	`rpad`	Pad a string on the right	true	None
spark.rapids.sql.expression.StringRepeat	`repeat`	StringRepeat operator that repeats the given strings with numbers of times given by repeatTimes	true	None
spark.rapids.sql.expression.StringReplace	`replace`	StringReplace operator	true	None
spark.rapids.sql.expression.StringSplit	`split`	Splits `str` around occurrences that match `regex`	true	None
spark.rapids.sql.expression.StringToMap	`str_to_map`	Creates a map after splitting the input string into pairs of key-value strings	true	None
spark.rapids.sql.expression.StringTranslate	`translate`	StringTranslate operator	true	This is not 100% compatible with the Spark version because the GPU implementation supports all unicode code points. In Spark versions < 3.2.0, translate() does not support unicode characters with code point >= U+10000 (See SPARK-34094)
spark.rapids.sql.expression.StringTrim	`trim`	StringTrim operator	true	None
spark.rapids.sql.expression.StringTrimLeft	`ltrim`	StringTrimLeft operator	true	None
spark.rapids.sql.expression.StringTrimRight	`rtrim`	StringTrimRight operator	true	None
spark.rapids.sql.expression.StructsToJson	`to_json`	Converts structs to JSON text format	false	This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the compatibility documentation to determine whether you can enable this configuration for your use case
spark.rapids.sql.expression.Substring	`substr`, `substring`	Substring operator	true	None
spark.rapids.sql.expression.SubstringIndex	`substring_index`	substring_index operator	true	None
spark.rapids.sql.expression.Subtract	`-`	Subtraction	true	None
spark.rapids.sql.expression.Tan	`tan`	Tangent	true	None
spark.rapids.sql.expression.Tanh	`tanh`	Hyperbolic tangent	true	None
spark.rapids.sql.expression.TimeAdd		Adds interval to timestamp	true	None
spark.rapids.sql.expression.ToDegrees	`degrees`	Converts radians to degrees	true	None
spark.rapids.sql.expression.ToRadians	`radians`	Converts degrees to radians	true	None
spark.rapids.sql.expression.ToUTCTimestamp	`to_utc_timestamp`	Render the input timestamp in UTC	true	None
spark.rapids.sql.expression.ToUnixTimestamp	`to_unix_timestamp`	Returns the UNIX timestamp of the given time	true	None
spark.rapids.sql.expression.TransformKeys	`transform_keys`	Transform keys in a map using a transform function	true	None
spark.rapids.sql.expression.TransformValues	`transform_values`	Transform values in a map using a transform function	true	None
spark.rapids.sql.expression.UnaryMinus	`negative`	Negate a numeric value	true	None
spark.rapids.sql.expression.UnaryPositive	`positive`	A numeric value with a + in front of it	true	None
spark.rapids.sql.expression.UnboundedFollowing$		Special boundary for a window frame, indicating all rows preceding the current row	true	None
spark.rapids.sql.expression.UnboundedPreceding$		Special boundary for a window frame, indicating all rows preceding the current row	true	None
spark.rapids.sql.expression.UnixTimestamp	`unix_timestamp`	Returns the UNIX timestamp of current or specified time	true	None
spark.rapids.sql.expression.UnscaledValue		Convert a Decimal to an unscaled long value for some aggregation optimizations	true	None
spark.rapids.sql.expression.Upper	`ucase`, `upper`	String uppercase operator	true	This is not 100% compatible with the Spark version because the Unicode version used by cuDF and the JVM may differ, resulting in some corner-case characters not changing case correctly.
spark.rapids.sql.expression.WeekDay	`weekday`	Returns the day of the week (0 = Monday…6=Sunday)	true	None
spark.rapids.sql.expression.WindowExpression		Calculates a return value for every input row of a table based on a group (or “window”) of rows	true	None
spark.rapids.sql.expression.WindowSpecDefinition		Specification of a window function, indicating the partitioning-expression, the row ordering, and the width of the window	true	None
spark.rapids.sql.expression.XxHash64	`xxhash64`	xxhash64 hash operator	true	None
spark.rapids.sql.expression.Year	`year`	Returns the year from a date or timestamp	true	None
spark.rapids.sql.expression.AggregateExpression		Aggregate expression	true	None
spark.rapids.sql.expression.ApproximatePercentile	`approx_percentile`, `percentile_approx`	Approximate percentile	true	This is not 100% compatible with the Spark version because the GPU implementation of approx_percentile is not bit-for-bit compatible with Apache Spark
spark.rapids.sql.expression.Average	`avg`, `mean`	Average aggregate operator	true	None
spark.rapids.sql.expression.CollectList	`collect_list`	Collect a list of non-unique elements, not supported in reduction	true	None
spark.rapids.sql.expression.CollectSet	`collect_set`	Collect a set of unique elements, not supported in reduction	true	None
spark.rapids.sql.expression.Count	`count`	Count aggregate operator	true	None
spark.rapids.sql.expression.First	`first_value`, `first`	first aggregate operator	true	None
spark.rapids.sql.expression.Last	`last_value`, `last`	last aggregate operator	true	None
spark.rapids.sql.expression.Max	`max`	Max aggregate operator	true	None
spark.rapids.sql.expression.Min	`min`	Min aggregate operator	true	None
spark.rapids.sql.expression.Percentile	`percentile`	Aggregation computing exact percentile	true	None
spark.rapids.sql.expression.PivotFirst		PivotFirst operator	true	None
spark.rapids.sql.expression.StddevPop	`stddev_pop`	Aggregation computing population standard deviation	true	None
spark.rapids.sql.expression.StddevSamp	`std`, `stddev_samp`, `stddev`	Aggregation computing sample standard deviation	true	None
spark.rapids.sql.expression.Sum	`sum`	Sum aggregate operator	true	None
spark.rapids.sql.expression.VariancePop	`var_pop`	Aggregation computing population variance	true	None
spark.rapids.sql.expression.VarianceSamp	`var_samp`, `variance`	Aggregation computing sample variance	true	None
spark.rapids.sql.expression.NormalizeNaNAndZero		Normalize NaN and zero	true	None
spark.rapids.sql.expression.ScalarSubquery		Subquery that will return only one row and one column	true	None
spark.rapids.sql.expression.HiveGenericUDF		Hive Generic UDF, the UDF can choose to implement a RAPIDS accelerated interface to get better performance	true	None
spark.rapids.sql.expression.HiveSimpleUDF		Hive UDF, the UDF can choose to implement a RAPIDS accelerated interface to get better performance	true	None

Execution

Name	Description	Default Value	Notes
spark.rapids.sql.exec.CoalesceExec	The backend for the dataframe coalesce method	true	None
spark.rapids.sql.exec.CollectLimitExec	Reduce to single partition and apply limit	false	This is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU
spark.rapids.sql.exec.ExpandExec	The backend for the expand operator	true	None
spark.rapids.sql.exec.FileSourceScanExec	Reading data from files, often from Hive tables	true	None
spark.rapids.sql.exec.FilterExec	The backend for most filter statements	true	None
spark.rapids.sql.exec.GenerateExec	The backend for operations that generate more output rows than input rows like explode	true	None
spark.rapids.sql.exec.GlobalLimitExec	Limiting of results across partitions	true	None
spark.rapids.sql.exec.LocalLimitExec	Per-partition limiting of results	true	None
spark.rapids.sql.exec.ProjectExec	The backend for most select, withColumn and dropColumn statements	true	None
spark.rapids.sql.exec.RangeExec	The backend for range operator	true	None
spark.rapids.sql.exec.SampleExec	The backend for the sample operator	true	None
spark.rapids.sql.exec.SortExec	The backend for the sort operator	true	None
spark.rapids.sql.exec.SubqueryBroadcastExec	Plan to collect and transform the broadcast key values	true	None
spark.rapids.sql.exec.TakeOrderedAndProjectExec	Take the first limit elements as defined by the sortOrder, and do projection if needed	true	None
spark.rapids.sql.exec.UnionExec	The backend for the union operator	true	None
spark.rapids.sql.exec.CustomShuffleReaderExec	A wrapper of shuffle query stage	true	None
spark.rapids.sql.exec.HashAggregateExec	The backend for hash based aggregations	true	None
spark.rapids.sql.exec.ObjectHashAggregateExec	The backend for hash based aggregations supporting TypedImperativeAggregate functions	true	None
spark.rapids.sql.exec.SortAggregateExec	The backend for sort based aggregations	true	None
spark.rapids.sql.exec.InMemoryTableScanExec	Implementation of InMemoryTableScanExec to use GPU accelerated caching	true	None
spark.rapids.sql.exec.DataWritingCommandExec	Writing data	true	None
spark.rapids.sql.exec.ExecutedCommandExec	Eagerly executed commands	true	None
spark.rapids.sql.exec.BatchScanExec	The backend for most file input	true	None
spark.rapids.sql.exec.BroadcastExchangeExec	The backend for broadcast exchange of data	true	None
spark.rapids.sql.exec.ShuffleExchangeExec	The backend for most data being exchanged between processes	true	None
spark.rapids.sql.exec.BroadcastHashJoinExec	Implementation of join using broadcast data	true	None
spark.rapids.sql.exec.BroadcastNestedLoopJoinExec	Implementation of join using brute force. Full outer joins and joins where the broadcast side matches the join side (e.g.: LeftOuter with left broadcast) are not supported	true	None
spark.rapids.sql.exec.CartesianProductExec	Implementation of join using brute force	true	None
spark.rapids.sql.exec.ShuffledHashJoinExec	Implementation of join using hashed shuffled data	true	None
spark.rapids.sql.exec.SortMergeJoinExec	Sort merge join, replacing with shuffled hash join	true	None
spark.rapids.sql.exec.AggregateInPandasExec	The backend for an Aggregation Pandas UDF, this accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled.	true	None
spark.rapids.sql.exec.ArrowEvalPythonExec	The backend of the Scalar Pandas UDFs. Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled	true	None
spark.rapids.sql.exec.FlatMapCoGroupsInPandasExec	The backend for CoGrouped Aggregation Pandas UDF. Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled.	false	This is disabled by default because Performance is not ideal with many small groups
spark.rapids.sql.exec.FlatMapGroupsInPandasExec	The backend for Flat Map Groups Pandas UDF, Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled.	true	None
spark.rapids.sql.exec.MapInPandasExec	The backend for Map Pandas Iterator UDF. Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled.	true	None
spark.rapids.sql.exec.WindowInPandasExec	The backend for Window Aggregation Pandas UDF, Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled. For now it only supports row based window frame.	false	This is disabled by default because it only supports row based frame for now
spark.rapids.sql.exec.WindowExec	Window-operator backend	true	None
spark.rapids.sql.exec.HiveTableScanExec	Scan Exec to read Hive delimited text tables	true	None

Commands

Name	Description	Default Value	Notes
spark.rapids.sql.command.SaveIntoDataSourceCommand	Write to a data source	true	None

Scans

Name	Description	Default Value	Notes
spark.rapids.sql.input.CSVScan	CSV parsing	true	None
spark.rapids.sql.input.JsonScan	Json parsing	true	None
spark.rapids.sql.input.OrcScan	ORC parsing	true	None
spark.rapids.sql.input.ParquetScan	Parquet parsing	true	None
spark.rapids.sql.input.AvroScan	Avro parsing	true	None

Partitioning

Name	Description	Default Value	Notes
spark.rapids.sql.partitioning.HashPartitioning	Hash based partitioning	true	None
spark.rapids.sql.partitioning.RangePartitioning	Range partitioning	true	None
spark.rapids.sql.partitioning.RoundRobinPartitioning	Round robin partitioning	true	None
spark.rapids.sql.partitioning.SinglePartition$	Single partitioning	true	None