3 Methods to Remove Data from the HDFS VAPUBLIC Directory

When working with HDFS, you are often loading data to the various directories to upload into SAS Visual Analytics. However, sometimes you want to remove that data. Here’s three different ways to remove the data.

Method 1: Use the SAS VA Administrator Tools

If you have access to the Manage Environment area, you can use the Explore HDFS tab to interact with the data in the HDFS. You can select the data set name and the Trashcan icon to delete the data.


If you are using Cloudera or HortonWorks, you can use their tools to manage the HDFS. Here’s how you would do it from the HUE File Browser from Cloudera. Many SAS VA users find it’s easier to with Hadoop using one of these commercial vendor providers.


Method 2: Use the Command Line

Hadoop is based on the LINUX file system, so many of the commands that you use with LINUX you can also use with Hadoop to control the HDFS. For instance, if you want to remove list the files in a directory you would use the ls command.  It works the same for the HDFS. In this example, I do the following things:

  1. List the contents of the vapublic subdirectory.
  2. Move hps/c_orders_main to the vapublic directory.
  3. List the contents of the vapublic subdirectory.
  4. Delete the vapublic/c_orders_main data set.


What’s different about this command is that you type “hdfs dfs  -[comand] <path>” instead of just the command and full path names are used. So here’s the steps above repeated as HDFS commands.

  1. hdfs dfs -ls /vapublic
  2. hdfs dfs -cp /hps/c_orders_main.sashdat    /vapublic
  3. hdfs dfs -ls /vapublic
  4. hdfs dfs -rm /hps/c_orders_main.sashdat

You can find a refernce for the commands are listed at the Apache Software Foundation Hadoop site.

Method 3: Use SAS Studio

If you have already been using code to upload the data, then you can use code to delete the data. To make it even easier, you delete it using PROC DATASETS the same as working with an SAS library. Refer to the SAS documentation for more details about the DATASETS procedure.

In this example, I assign the VAPUBLIC directory to a library using the LIBNAME statement. The SASHDAT engine allows SAS to distribute the dataset into blocks across the data nodes. This is similar to how SAS works with any other library.

/*Assign the Library Contents */
 libname myHDFS SASHDAT
          host="yourserver.com"  install="/opt/sas/TKGrid"
 /*Delete the file from the library */
 proc datasets lib=myhdfs nodetails nolist;
      delete c_orders_main;

Here’s the log from the code above. You can see the library was assigned and then the dataset was deleted. Unlike with the command line method, you only list the data set name and not the extension.

59 libname myHDFS SASHDAT
60 host="myserver.com"
61 install="/opt/sas.com"
62 path="/vapublic";
NOTE: Libref MYHDFS was successfully assigned as follows:
Physical Name: Directory '/vapublic' of HDFS cluster on host 'myserver.com'

70 ! proc datasets lib=myhdfs nodetails nolist;
71 delete candy_adf;
72 run;

NOTE: Deleting MYHDFS.CANDY_ADF (memtype=DATA).

If you do know the host or TKGrid installation path, you can discover it by asking the SAS administrator. Another method is to check the code written by the Data Builder. You will have to create a temporary query that saves to the HDFS library, here’s an example.


These examples were created with the distributed SAS Visual Analytics 7.3.

Never miss a BI Notes post!

Click here for free subscription. Once you subscribe you'll be asked to confirm your subscription through your email account. You email address is kept private and you can unsubscribe anytime.