3 Data Wrangling

3.1 Clean Data

The raw data used in much of political science research rarely comes to us in a format that is immediately accessible for analysis. Instead, we frequently need to write an R script to “clean” the data first. Only after cleaning the data will it be usable for visualization or statistical modeling. This process (also known as “data wrangling”) can be extremely arduous and time-consuming—depending on how messy the raw data is in the first place. According to the common aphorism, the time you spend writing R code to clean your data sets will far exceed the time you spend subsequently running any sort of statistical analysis on the cleaned data.

Although data cleaning has a reputation as a dull, menial task (compared to the “fun” of statistical modeling), try to avoid treating the process like it is merely a roadblock—something to be overcome before you can begin the real analysis. In fact, each decision you make when wrangling your raw data into its final state is a crucial part of the final research product. Whether or not to drop certain observations, or using different levels of data aggregation, are some examples of choices which can have massive downstream effects. You should be thoughtful and transparent about your decision-making process the whole time you spend cleaning data.

3.2 Cleaning Data Using dplyr

As in the other sections of this book, we will be using the Tidyverse approach to data cleaning here. The primary Tidyverse package for data cleaning is dplyr and it contains most of the functions we will be using in this chapter.

Rather than downloading a .csv file and using the read_csv() function to load in our data for this chapter, we will be using a package which contains multiple data sets. The data come from the United Nations General Assembly. First install the unvotes package by running the following command in your Console.

install.packages("unvotes")

Then run

library(unvotes)
library(tidyverse)

When you use the library() function, many packages automatically load small data sets which immediately become available for use. However, accessing these data can be a bit confusing because they do not automatically show up as objects in your Environment tab. Let’s add the unvotes data sets to our Environment with the following chunk of code:

un_votes <- un_votes
un_roll_calls <- un_roll_calls
un_roll_call_issues <- un_roll_call_issues

About the data

un_votes, country-vote level data. Each row is a country’s vote on a particular UN Assembly resolution.
un_roll_calls, resolution level data. Contains information about each resolution.
un_roll_call_issues, resolution level data. Contains the issue-area for each resolution.

The Pipe Operator

Before we get started using dplyr we need to first introduce the “pipe” operator |>. Pipes are an extremely convenient tool for linking several functions together. They work by passing the object on the left hand side of the pipe into the function following the pipe. Here is an example.

# The pipe way
me |> 
  wake_up(time = "8:00") |> 
  get_out_of_bed(side = "correct") |> 
  get_dressed(pants = TRUE, shirt = TRUE) |> 
  leave_house(car = FALSE, bike = TRUE)

# The non-pipe way
leave_house(get_dressed(get_out_of_bed(wake_up(me, time = "8:00"), side = "correct"), pants = TRUE, shirt = TRUE), car = FALSE, bike = TRUE)

Both code chunks above will end up doing the same thing. But the pipe method is much easier to write, and much easier for others to read and understand.

Before R version 4.1, the pipe operator was only available in the magrittr package (which remains part of the Tidyverse). The magrittr pipe is written %>% and many R users continue to use this over the native R |>. There are a few very minor differences between the two pipes, which you should feel free to ignore. We use the native R |> pipe in this book because it is generally a good idea, all else equal, to reduce your dependency on outside packages when writing code.

3.3 Working with Columns

Each column in a data set typically represents a single variable, or attribute, relating to the observation in a particular row. In this section we will look at some of dplyr’s functions for working with columns.

Select

The select() function is primarily used to remove unwanted columns from the data. Here is an example.

un_roll_calls |> 
  select(rcid, date) # Keeping only two variables

# A tibble: 6,202 × 2
    rcid date      
   <int> <date>    
 1     3 1946-01-01
 2     4 1946-01-02
 3     5 1946-01-04
 4     6 1946-01-04
 5     7 1946-01-02
 6     8 1946-01-05
 7     9 1946-02-05
 8    10 1946-02-05
 9    11 1946-02-05
10    12 1946-02-06
# … with 6,192 more rows
# ℹ Use `print(n = ...)` to see more rows

Note that the code chunk above did not alter the data set object un_roll_calls. If we wanted to take this smaller data set and use it for something else, we will need to use the assignment operator <- to save our work.

# This creates a new object called "small_un_roll_calls"
# containing only rcid and date columns
small_un_roll_calls <- un_roll_calls |> 
  select(rcid, date)

The select() function can be very versatile. Rather than typing every column name we want to keep, it is often faster to specify the columns we want to drop.

un_roll_calls |> 
  select(-session) # Keep everything except session column

# A tibble: 6,202 × 8
    rcid importantvote date       unres   amend  para short                descr
   <int>         <int> <date>     <chr>   <int> <int> <chr>                <chr>
 1     3             0 1946-01-01 R/1/66      1     0 AMENDMENTS, RULES O… "TO …
 2     4             0 1946-01-02 R/1/79      0     0 SECURITY COUNCIL EL… "TO …
 3     5             0 1946-01-04 R/1/98      0     0 VOTING PROCEDURE     "TO …
 4     6             0 1946-01-04 R/1/107     0     0 DECLARATION OF HUMA… "TO …
 5     7             0 1946-01-02 R/1/295     1     0 GENERAL ASSEMBLY EL… "TO …
 6     8             0 1946-01-05 R/1/297     1     0 ECOSOC POWERS        "TO …
 7     9             0 1946-02-05 R/1/329     0     0 POST-WAR RECONSTRUC… "TO …
 8    10             0 1946-02-05 R/1/361     1     1 U.N. MEMBERS, RELAT… "TO …
 9    11             0 1946-02-05 R/1/376     0     0 TRUSTEESHIP AMENDME… "TO …
10    12             0 1946-02-06 R/1/394     1     1 COUNCIL MEMBER TERM… "TO …
# … with 6,192 more rows
# ℹ Use `print(n = ...)` to see more rows

There are many other convenient ways to selection columns (e.g. by common names, by type of data, or by position in the data set). For a full list of ways to use select() see this link. Here is one example of selecting every column which is a “character” type.

un_roll_calls |> 
  select(where(is.character))

# A tibble: 6,202 × 3
   unres   short                              descr                             
   <chr>   <chr>                              <chr>                             
 1 R/1/66  AMENDMENTS, RULES OF PROCEDURE     "TO ADOPT A CUBAN AMENDMENT TO TH…
 2 R/1/79  SECURITY COUNCIL ELECTIONS         "TO ADOPT A USSR PROPOSAL ADJOURN…
 3 R/1/98  VOTING PROCEDURE                   "TO ADOPT THE KOREAN PROPOSAL THA…
 4 R/1/107 DECLARATION OF HUMAN RIGHTS        "TO ADOPT A CUBAN PROPOSAL (A/3-C…
 5 R/1/295 GENERAL ASSEMBLY ELECTIONS         "TO ADOPT A 6TH COMMITTEE AMENDME…
 6 R/1/297 ECOSOC POWERS                      "TO ADOPT A SECOND 6TH COMM. AMEN…
 7 R/1/329 POST-WAR RECONSTRUCTION            "TO OPEN THE DISCUSSION ON THE PO…
 8 R/1/361 U.N. MEMBERS, RELATIONS WITH SPAIN "TO ADOPT GENERAL COMM. DRAFT RES…
 9 R/1/376 TRUSTEESHIP AMENDMENTS             "TO ADOPT DRAFT RESOLUTIONS I AND…
10 R/1/394 COUNCIL MEMBER TERM LENGTH         "TO ADOPT PARAGRAPH (A) OF THE 6T…
# … with 6,192 more rows
# ℹ Use `print(n = ...)` to see more rows

Lastly, we can use select() to rename columns in our data. The chunk of code below selects the “unres” and date columns, and renames “unres” to “un_resolution” at the same time.

un_roll_calls |> 
  select(un_resolution = unres, date)

# A tibble: 6,202 × 2
   un_resolution date      
   <chr>         <date>    
 1 R/1/66        1946-01-01
 2 R/1/79        1946-01-02
 3 R/1/98        1946-01-04
 4 R/1/107       1946-01-04
 5 R/1/295       1946-01-02
 6 R/1/297       1946-01-05
 7 R/1/329       1946-02-05
 8 R/1/361       1946-02-05
 9 R/1/376       1946-02-05
10 R/1/394       1946-02-06
# … with 6,192 more rows
# ℹ Use `print(n = ...)` to see more rows

If you only want to rename columns—without specifying which to select—dplyr has a function rename() for this purpose. The syntax is similar to the code chunk above new_variable_name = original_variable_name. This code will keep all columns and change the names of the specified variables.

un_roll_calls |> 
  rename(un_resolution = unres,
         amendment = amend,
         paragraph = para)

# A tibble: 6,202 × 9
    rcid session importantvote date       un_resol…¹ amend…² parag…³ short descr
   <int>   <dbl>         <int> <date>     <chr>        <int>   <int> <chr> <chr>
 1     3       1             0 1946-01-01 R/1/66           1       0 AMEN… "TO …
 2     4       1             0 1946-01-02 R/1/79           0       0 SECU… "TO …
 3     5       1             0 1946-01-04 R/1/98           0       0 VOTI… "TO …
 4     6       1             0 1946-01-04 R/1/107          0       0 DECL… "TO …
 5     7       1             0 1946-01-02 R/1/295          1       0 GENE… "TO …
 6     8       1             0 1946-01-05 R/1/297          1       0 ECOS… "TO …
 7     9       1             0 1946-02-05 R/1/329          0       0 POST… "TO …
 8    10       1             0 1946-02-05 R/1/361          1       1 U.N.… "TO …
 9    11       1             0 1946-02-05 R/1/376          0       0 TRUS… "TO …
10    12       1             0 1946-02-06 R/1/394          1       1 COUN… "TO …
# … with 6,192 more rows, and abbreviated variable names ¹un_resolution,
#   ²amendment, ³paragraph
# ℹ Use `print(n = ...)` to see more rows

Mutate

The select() and rename() functions are great for tidying up your data sets, but they do not change the underlying variables. To change existing variables or to create new ones we use mutate().

Let’s say we discovered that all the roll call IDs in the column “rcid” were supposed to be in multiples of 10. The code chunk below creates a new variable called “rcid_10” which is simply the value of the original “rcid” variable multiplied by 10.

un_votes |> 
  mutate(rcid_10 = rcid * 10) # Creates the new rcid_10 variable

# A tibble: 869,937 × 5
    rcid country            country_code vote  rcid_10
   <dbl> <chr>              <chr>        <fct>   <dbl>
 1     3 United States      US           yes        30
 2     3 Canada             CA           no         30
 3     3 Cuba               CU           yes        30
 4     3 Haiti              HT           yes        30
 5     3 Dominican Republic DO           yes        30
 6     3 Mexico             MX           yes        30
 7     3 Guatemala          GT           yes        30
 8     3 Honduras           HN           yes        30
 9     3 El Salvador        SV           yes        30
10     3 Nicaragua          NI           yes        30
# … with 869,927 more rows
# ℹ Use `print(n = ...)` to see more rows

If you didn’t want to create a brand new variable, but instead wanted to overwrite the original variable, you just need to put the original variable to the left of the = in mutate().

un_votes |> 
  mutate(rcid = rcid * 10) # Overwrites the existing rcid variable

# A tibble: 869,937 × 4
    rcid country            country_code vote 
   <dbl> <chr>              <chr>        <fct>
 1    30 United States      US           yes  
 2    30 Canada             CA           no   
 3    30 Cuba               CU           yes  
 4    30 Haiti              HT           yes  
 5    30 Dominican Republic DO           yes  
 6    30 Mexico             MX           yes  
 7    30 Guatemala          GT           yes  
 8    30 Honduras           HN           yes  
 9    30 El Salvador        SV           yes  
10    30 Nicaragua          NI           yes  
# … with 869,927 more rows
# ℹ Use `print(n = ...)` to see more rows

We commonly need to create a new variable whose values depend on the values of one of the original variables. The function case_when() helps us do this inside mutate().

un_votes |> 
  mutate(vote_dummy = case_when(vote == "yes" ~ 1,
                                vote == "no" ~ 0))

# A tibble: 869,937 × 5
    rcid country            country_code vote  vote_dummy
   <dbl> <chr>              <chr>        <fct>      <dbl>
 1     3 United States      US           yes            1
 2     3 Canada             CA           no             0
 3     3 Cuba               CU           yes            1
 4     3 Haiti              HT           yes            1
 5     3 Dominican Republic DO           yes            1
 6     3 Mexico             MX           yes            1
 7     3 Guatemala          GT           yes            1
 8     3 Honduras           HN           yes            1
 9     3 El Salvador        SV           yes            1
10     3 Nicaragua          NI           yes            1
# … with 869,927 more rows
# ℹ Use `print(n = ...)` to see more rows

As you can see, the code chunk above creates a new variable called “vote_dummy” which takes the value 1 if “vote” equals "yes" and takes the values 0 if “vote” equals "no". Like other programming languages, R uses the == logical operator to check whether a value equals, or is equivalent, to some other value. This is different from the single = which is used to create entire new variables inside mutate().

Here is one more example using case_when().

un_votes |> 
  mutate(rcid_era = case_when(rcid < 2000 ~ "old",
                              rcid >= 2000 & rcid < 6000 ~ "middle",
                              rcid >= 6000 ~ "recent"))

# A tibble: 869,937 × 5
    rcid country            country_code vote  rcid_era
   <dbl> <chr>              <chr>        <fct> <chr>   
 1     3 United States      US           yes   old     
 2     3 Canada             CA           no    old     
 3     3 Cuba               CU           yes   old     
 4     3 Haiti              HT           yes   old     
 5     3 Dominican Republic DO           yes   old     
 6     3 Mexico             MX           yes   old     
 7     3 Guatemala          GT           yes   old     
 8     3 Honduras           HN           yes   old     
 9     3 El Salvador        SV           yes   old     
10     3 Nicaragua          NI           yes   old     
# … with 869,927 more rows
# ℹ Use `print(n = ...)` to see more rows

The new variable “rcid_era” takes three values, "old", "middle", and "recent" based on what range the “rcid” variable falls within.

To recap, the most commonly used functions for cleaning columns/variables in your data are:

select() removes and/or renames columns.
rename() renames existing columns without removing any.
mutate() changes the values of existing columns and creates new columns.

3.4 Working with Rows

Filter

Dplyr’s primary function for removing unwanted rows is filter(). Like we saw when using case_when() inside mutate(), using filter() requires some practice with logical operators. The function filter() works by specifying some variable and only keeping rows in the data for which the logical operation evaluates to TRUE. Here is an example.

un_roll_calls |> 
  filter(importantvote == 1)

# A tibble: 411 × 9
    rcid session importantvote date       unres     amend  para short      descr
   <int>   <dbl>         <int> <date>     <chr>     <int> <int> <chr>      <chr>
 1  2491      38             1 1983-10-04 R/38/3        0     0 KAMPUCHEA  TO R…
 2  2492      38             1 1983-11-06 R/38/7        0     0 GRENADA, … TO D…
 3  2497      38             1 1983-11-06 R/38/29       0     0 AFGHANIST… TO D…
 4  2504      38             1 1983-12-06 R/38/39A      0     0 APARTHEID… TO A…
 5  2510      38             1 1983-12-06 R/38/39G      0     0 SOUTH AFR… TO R…
 6  2526      38             1 1983-12-06 R/38/180E     0     0 ISRAEL, I… TO O…
 7  2563      38             1 1983-12-07 R/38/187C     0     0 CHEMICAL,… TO N…
 8  2610      38             1 1983-12-03 R/38/101      0     0 HUMAN RIG… TO E…
 9  2641      39             1 1984-10-04 R/39/5        0     0 KAMPUCHEA  TO R…
10  2645      39             1 1984-11-02 R/39/13       0     0 AFGHANIST… TO R…
# … with 401 more rows
# ℹ Use `print(n = ...)` to see more rows

The original “un_roll_calls” data set object has 6202 rows, whereas this new data set only has 411 rows. This is because we filtered out any row in which the variable “importantvote” was not equal to 1. Like select(), we can extend the use of filter() in many ways.

# & for AND
un_roll_calls |> 
  filter(importantvote == 1 & session > 40)

# A tibble: 386 × 9
    rcid session importantvote date       unres     amend  para short      descr
   <int>   <dbl>         <int> <date>     <chr>     <int> <int> <chr>      <chr>
 1  2948      41             1 1986-10-03 R/41/6       NA    NA KAMPUCHEA  The …
 2  2957      41             1 1986-11-02 R/41/31      NA    NA NICARAGUA… Urge…
 3  2958      41             1 1986-11-04 R/41/33      NA    NA AFGHANIST… The …
 4  2967      41             1 1986-11-05 R/41/38      NA    NA LIBYA, U.… Decl…
 5  2968      41             1 1986-11-05 R/41/39A     NA    NA NAMIBIA, … Situ…
 6  2996      41             1 1986-12-04 R/41/58C     NA    NA CHEMICAL,… Proh…
 7  3075      41             1 1986-12-05 R/41/158     NA    NA HUMAN RIG… Ques…
 8  3078      41             1 1986-12-05 R/41/162A    NA    NA MIDDLE EA… Reaf…
 9  3098      41             1 1986-12-05 R/41/211A    NA    NA BUDGET, 1… Revi…
10  3099      41             1 1986-12-05 R/41/211B    NA    NA BUDGET, 1… Revi…
# … with 376 more rows
# ℹ Use `print(n = ...)` to see more rows

# | for OR
un_roll_calls |> 
  filter(importantvote == 1 | session > 40)

# A tibble: 3,228 × 9
    rcid session importantvote date       unres     amend  para short      descr
   <int>   <dbl>         <int> <date>     <chr>     <int> <int> <chr>      <chr>
 1  2491      38             1 1983-10-04 R/38/3        0     0 KAMPUCHEA  TO R…
 2  2492      38             1 1983-11-06 R/38/7        0     0 GRENADA, … TO D…
 3  2497      38             1 1983-11-06 R/38/29       0     0 AFGHANIST… TO D…
 4  2504      38             1 1983-12-06 R/38/39A      0     0 APARTHEID… TO A…
 5  2510      38             1 1983-12-06 R/38/39G      0     0 SOUTH AFR… TO R…
 6  2526      38             1 1983-12-06 R/38/180E     0     0 ISRAEL, I… TO O…
 7  2563      38             1 1983-12-07 R/38/187C     0     0 CHEMICAL,… TO N…
 8  2610      38             1 1983-12-03 R/38/101      0     0 HUMAN RIG… TO E…
 9  2641      39             1 1984-10-04 R/39/5        0     0 KAMPUCHEA  TO R…
10  2645      39             1 1984-11-02 R/39/13       0     0 AFGHANIST… TO R…
# … with 3,218 more rows
# ℹ Use `print(n = ...)` to see more rows

The first chunk above filters out all rows in which “importantvote” did not equal 1 AND “session” was less than 40. The second chunk above filters out all rows in which “importantvote” did not equal 1 OR “session” was less than 40.

Removing missing, or NA, values is another common job for filter(). To do this you can use the function is.na() inside filter(). The function is.na() evaluates to TRUE if the variable’s value is NA and evaluates to FALSE otherwise. Because filter() only keeps rows where the condition evaluates to TRUE, in order to remove NA values we need to negate is.na() with the ! operator. The ! operator flips the truthiness of any logical statement it precedes.

un_roll_calls |> 
  filter(!is.na(amend)) # Removing rows with NA for the amend variable

# A tibble: 2,868 × 9
    rcid session importantvote date       unres   amend  para short        descr
   <int>   <dbl>         <int> <date>     <chr>   <int> <int> <chr>        <chr>
 1     3       1             0 1946-01-01 R/1/66      1     0 AMENDMENTS,… "TO …
 2     4       1             0 1946-01-02 R/1/79      0     0 SECURITY CO… "TO …
 3     5       1             0 1946-01-04 R/1/98      0     0 VOTING PROC… "TO …
 4     6       1             0 1946-01-04 R/1/107     0     0 DECLARATION… "TO …
 5     7       1             0 1946-01-02 R/1/295     1     0 GENERAL ASS… "TO …
 6     8       1             0 1946-01-05 R/1/297     1     0 ECOSOC POWE… "TO …
 7     9       1             0 1946-02-05 R/1/329     0     0 POST-WAR RE… "TO …
 8    10       1             0 1946-02-05 R/1/361     1     1 U.N. MEMBER… "TO …
 9    11       1             0 1946-02-05 R/1/376     0     0 TRUSTEESHIP… "TO …
10    12       1             0 1946-02-06 R/1/394     1     1 COUNCIL MEM… "TO …
# … with 2,858 more rows
# ℹ Use `print(n = ...)` to see more rows

It can be confusing to think in terms of logical statements, especially when negation is involved! But you will become more comfortable with filter() the more you practice.

Another handy logical operator is %in%. This lets you specify several values at once when checking whether to keep rows in filter().

# Cumbersome way
un_votes |> 
  filter(country == "Kenya" |
         country == "Grenada" |
         country == "Canada" |
         country == "Latvia" |
         country == "Yemen" |
         country == "Angola")

# A tibble: 23,324 × 4
    rcid country country_code vote 
   <dbl> <chr>   <chr>        <fct>
 1     3 Canada  CA           no   
 2     4 Canada  CA           no   
 3     5 Canada  CA           no   
 4     6 Canada  CA           no   
 5     7 Canada  CA           no   
 6     8 Canada  CA           yes  
 7     9 Canada  CA           yes  
 8    10 Canada  CA           yes  
 9    11 Canada  CA           yes  
10    12 Canada  CA           yes  
# … with 23,314 more rows
# ℹ Use `print(n = ...)` to see more rows

# Easier way
country_list <- c("Kenya", "Grenada", "Canada",
                  "Latvia", "Yemen", "Angola")
un_votes |> 
  filter(country %in% country_list)

# A tibble: 23,324 × 4
    rcid country country_code vote 
   <dbl> <chr>   <chr>        <fct>
 1     3 Canada  CA           no   
 2     4 Canada  CA           no   
 3     5 Canada  CA           no   
 4     6 Canada  CA           no   
 5     7 Canada  CA           no   
 6     8 Canada  CA           yes  
 7     9 Canada  CA           yes  
 8    10 Canada  CA           yes  
 9    11 Canada  CA           yes  
10    12 Canada  CA           yes  
# … with 23,314 more rows
# ℹ Use `print(n = ...)` to see more rows

The two chunks of code above will produce the same filtered data set, but the second chunk requires fewer lines of code and is more flexible if we want to add or remove countries.

Aggregation

Filtering is great for removing unwanted rows in your data. However, after using filter() the data set’s unit of analysis typically remains the same. In the code chunk above, for example, we removed most of the countries from our data, but each row in the resulting data set is still a unique country-vote observation.

Let’s say we want to reduce the data set down to the country level, rather than country-vote level, and examine the proportion of “no” votes taken by each country. We can do this in dplyr using the pair of functions group_by() and summarize(). Here is an example.

un_votes |> 
  # Creating a new numeric dummy vote variable
  mutate(vote_dummy = case_when(vote == "yes" ~ 1,
                                vote == "no" ~ 0)) |> 
  # Specify level of aggregation
  group_by(country) |> 
  # Perform the aggregation function by group
  summarize(proportion_yes_vote = mean(vote_dummy, na.rm = TRUE))

# A tibble: 200 × 2
   country           proportion_yes_vote
   <chr>                           <dbl>
 1 Afghanistan                     0.938
 2 Albania                         0.816
 3 Algeria                         0.965
 4 Andorra                         0.833
 5 Angola                          0.985
 6 Antigua & Barbuda               0.988
 7 Argentina                       0.935
 8 Armenia                         0.950
 9 Australia                       0.745
10 Austria                         0.880
# … with 190 more rows
# ℹ Use `print(n = ...)` to see more rows

The code above uses the function mean() to calculate the average value of “vote_dummy” within each country. Because “vote_dummy” takes the values 1 and 0, this average can be interpreted as a proportion of 1’s, or “yes” votes. We need to add na.rm = TRUE inside mean() to tell R to ignore NA values when calculating this average.

If we want to sort our data by the new aggregated column, we can do so using arrange() after summarize().

un_votes |> 
  mutate(vote_dummy = case_when(vote == "yes" ~ 1,
                                vote == "no" ~ 0)) |> 
  group_by(country) |> 
  summarize(proportion_yes_vote = mean(vote_dummy, na.rm = TRUE)) |> 
  arrange(proportion_yes_vote)

# A tibble: 200 × 2
   country                          proportion_yes_vote
   <chr>                                          <dbl>
 1 United States                                  0.369
 2 Israel                                         0.464
 3 Palau                                          0.541
 4 Micronesia (Federated States of)               0.604
 5 United Kingdom                                 0.616
 6 Marshall Islands                               0.659
 7 France                                         0.662
 8 Federal Republic of Germany                    0.682
 9 Canada                                         0.701
10 Belgium                                        0.714
# … with 190 more rows
# ℹ Use `print(n = ...)` to see more rows

# Use arrange(desc(proportion_yes_vote))
# to sort in descending order

The United States votes “No” at a much higher rate than other countries!

Lastly, the dplyr function count() is very handy for quickly tallying the number of observations within each category of a variable. Simply put the name of the variable you want to count inside count() and voila.

un_roll_call_issues |> 
  count(issue)

# A tibble: 6 × 2
  issue                                    n
  <fct>                                <int>
1 Colonialism                            957
2 Arms control and disarmament          1092
3 Economic development                   765
4 Human rights                          1015
5 Palestinian conflict                  1061
6 Nuclear weapons and nuclear material   855

3.5 Merging Data

So far we have only been working with one data set at a time. But it is rare to find all the data you need for a particular project in one single data set. Frequently we will have to merge data sets together in order to have all our variables in the same place. The most common way to merge data using dplyr is with the left_join() function.

To illustrate how left_join() works, let’s say we have two data sets: our main data set, x and the data set we want to merge, y. Here is an example:

xy_data <- left_join(x, y, by = "key_variable")

The new data set xy_data is produced by merging the y data set into the x data set based on the shared value of the key_variable variable. Left joins will always keep all rows in the x data set, but will only merge in rows from y if they match up with the key_variable. Your key variable will typically be something like country ID code or a unit’s name in the data. Because we like to use pipes in this course, the chunk of code above can be rewritten as:

xy_data <- x |> 
  left_join(y, by = "key_variable")

Now let’s take a look at left_join() using real data from the unvotes data set.

un_votes |> 
  left_join(un_roll_calls, by = "rcid")

# A tibble: 869,937 × 12
    rcid country      count…¹ vote  session impor…² date       unres amend  para
   <dbl> <chr>        <chr>   <fct>   <dbl>   <int> <date>     <chr> <int> <int>
 1     3 United Stat… US      yes         1       0 1946-01-01 R/1/…     1     0
 2     3 Canada       CA      no          1       0 1946-01-01 R/1/…     1     0
 3     3 Cuba         CU      yes         1       0 1946-01-01 R/1/…     1     0
 4     3 Haiti        HT      yes         1       0 1946-01-01 R/1/…     1     0
 5     3 Dominican R… DO      yes         1       0 1946-01-01 R/1/…     1     0
 6     3 Mexico       MX      yes         1       0 1946-01-01 R/1/…     1     0
 7     3 Guatemala    GT      yes         1       0 1946-01-01 R/1/…     1     0
 8     3 Honduras     HN      yes         1       0 1946-01-01 R/1/…     1     0
 9     3 El Salvador  SV      yes         1       0 1946-01-01 R/1/…     1     0
10     3 Nicaragua    NI      yes         1       0 1946-01-01 R/1/…     1     0
# … with 869,927 more rows, 2 more variables: short <chr>, descr <chr>, and
#   abbreviated variable names ¹country_code, ²importantvote
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

The code chunk above merged all the variables from the un_roll_calls data set into the un_votes data set. Recall that the un_votes data contains information about how countries voted on particular UN roll call votes, and the un_roll_calls data contains information about the specific votes (such as whether the vote was an “importantvote”). If there were any rows in un_roll_calls that had rcid values which failed to match with an rcid value in un_votes, the row would not get merged.

While left_join() is far and away the most common, and useful, way to merge two data sets when using dplyr, there are various other types of merges which have their niche applications. For a more in-depth look at merging data in dplyr see this page with helpful animations.