install.packages("unvotes")
3 Data Wrangling
3.1 Clean Data
The raw data used in much of political science research rarely comes to us in a format that is immediately accessible for analysis. Instead, we frequently need to write an R script to “clean” the data first. Only after cleaning the data will it be usable for visualization or statistical modeling. This process (also known as “data wrangling”) can be extremely arduous and time-consuming—depending on how messy the raw data is in the first place. According to the common aphorism, the time you spend writing R code to clean your data sets will far exceed the time you spend subsequently running any sort of statistical analysis on the cleaned data.
Although data cleaning has a reputation as a dull, menial task (compared to the “fun” of statistical modeling), try to avoid treating the process like it is merely a roadblock—something to be overcome before you can begin the real analysis. In fact, each decision you make when wrangling your raw data into its final state is a crucial part of the final research product. Whether or not to drop certain observations, or using different levels of data aggregation, are some examples of choices which can have massive downstream effects. You should be thoughtful and transparent about your decision-making process the whole time you spend cleaning data.
3.2 Cleaning Data Using dplyr
As in the other sections of this book, we will be using the Tidyverse approach to data cleaning here. The primary Tidyverse package for data cleaning is dplyr and it contains most of the functions we will be using in this chapter.
Rather than downloading a .csv file and using the read_csv()
function to load in our data for this chapter, we will be using a package which contains multiple data sets. The data come from the United Nations General Assembly. First install the unvotes package by running the following command in your Console.
Then run
When you use the library()
function, many packages automatically load small data sets which immediately become available for use. However, accessing these data can be a bit confusing because they do not automatically show up as objects in your Environment tab. Let’s add the unvotes data sets to our Environment with the following chunk of code:
un_votes <- un_votes
un_roll_calls <- un_roll_calls
un_roll_call_issues <- un_roll_call_issues
About the data
-
un_votes
, country-vote level data. Each row is a country’s vote on a particular UN Assembly resolution. -
un_roll_calls
, resolution level data. Contains information about each resolution. -
un_roll_call_issues
, resolution level data. Contains the issue-area for each resolution.
The Pipe Operator
Before we get started using dplyr we need to first introduce the “pipe” operator |>
. Pipes are an extremely convenient tool for linking several functions together. They work by passing the object on the left hand side of the pipe into the function following the pipe. Here is an example.
# The pipe way
me |>
wake_up(time = "8:00") |>
get_out_of_bed(side = "correct") |>
get_dressed(pants = TRUE, shirt = TRUE) |>
leave_house(car = FALSE, bike = TRUE)
# The non-pipe way
leave_house(get_dressed(get_out_of_bed(wake_up(me, time = "8:00"), side = "correct"), pants = TRUE, shirt = TRUE), car = FALSE, bike = TRUE)
Both code chunks above will end up doing the same thing. But the pipe method is much easier to write, and much easier for others to read and understand.
Before R version 4.1, the pipe operator was only available in the magrittr package (which remains part of the Tidyverse). The magrittr pipe is written %>%
and many R users continue to use this over the native R |>
. There are a few very minor differences between the two pipes, which you should feel free to ignore. We use the native R |>
pipe in this book because it is generally a good idea, all else equal, to reduce your dependency on outside packages when writing code.
3.3 Working with Columns
Each column in a data set typically represents a single variable, or attribute, relating to the observation in a particular row. In this section we will look at some of dplyr’s functions for working with columns.
Select
The select()
function is primarily used to remove unwanted columns from the data. Here is an example.
un_roll_calls |>
select(rcid, date) # Keeping only two variables
# A tibble: 6,202 × 2
rcid date
<int> <date>
1 3 1946-01-01
2 4 1946-01-02
3 5 1946-01-04
4 6 1946-01-04
5 7 1946-01-02
6 8 1946-01-05
7 9 1946-02-05
8 10 1946-02-05
9 11 1946-02-05
10 12 1946-02-06
# … with 6,192 more rows
# ℹ Use `print(n = ...)` to see more rows
Note that the code chunk above did not alter the data set object un_roll_calls
. If we wanted to take this smaller data set and use it for something else, we will need to use the assignment operator <-
to save our work.
# This creates a new object called "small_un_roll_calls"
# containing only rcid and date columns
small_un_roll_calls <- un_roll_calls |>
select(rcid, date)
The select()
function can be very versatile. Rather than typing every column name we want to keep, it is often faster to specify the columns we want to drop.
un_roll_calls |>
select(-session) # Keep everything except session column
# A tibble: 6,202 × 8
rcid importantvote date unres amend para short descr
<int> <int> <date> <chr> <int> <int> <chr> <chr>
1 3 0 1946-01-01 R/1/66 1 0 AMENDMENTS, RULES O… "TO …
2 4 0 1946-01-02 R/1/79 0 0 SECURITY COUNCIL EL… "TO …
3 5 0 1946-01-04 R/1/98 0 0 VOTING PROCEDURE "TO …
4 6 0 1946-01-04 R/1/107 0 0 DECLARATION OF HUMA… "TO …
5 7 0 1946-01-02 R/1/295 1 0 GENERAL ASSEMBLY EL… "TO …
6 8 0 1946-01-05 R/1/297 1 0 ECOSOC POWERS "TO …
7 9 0 1946-02-05 R/1/329 0 0 POST-WAR RECONSTRUC… "TO …
8 10 0 1946-02-05 R/1/361 1 1 U.N. MEMBERS, RELAT… "TO …
9 11 0 1946-02-05 R/1/376 0 0 TRUSTEESHIP AMENDME… "TO …
10 12 0 1946-02-06 R/1/394 1 1 COUNCIL MEMBER TERM… "TO …
# … with 6,192 more rows
# ℹ Use `print(n = ...)` to see more rows
There are many other convenient ways to selection columns (e.g. by common names, by type of data, or by position in the data set). For a full list of ways to use select()
see this link. Here is one example of selecting every column which is a “character” type.
un_roll_calls |>
select(where(is.character))
# A tibble: 6,202 × 3
unres short descr
<chr> <chr> <chr>
1 R/1/66 AMENDMENTS, RULES OF PROCEDURE "TO ADOPT A CUBAN AMENDMENT TO TH…
2 R/1/79 SECURITY COUNCIL ELECTIONS "TO ADOPT A USSR PROPOSAL ADJOURN…
3 R/1/98 VOTING PROCEDURE "TO ADOPT THE KOREAN PROPOSAL THA…
4 R/1/107 DECLARATION OF HUMAN RIGHTS "TO ADOPT A CUBAN PROPOSAL (A/3-C…
5 R/1/295 GENERAL ASSEMBLY ELECTIONS "TO ADOPT A 6TH COMMITTEE AMENDME…
6 R/1/297 ECOSOC POWERS "TO ADOPT A SECOND 6TH COMM. AMEN…
7 R/1/329 POST-WAR RECONSTRUCTION "TO OPEN THE DISCUSSION ON THE PO…
8 R/1/361 U.N. MEMBERS, RELATIONS WITH SPAIN "TO ADOPT GENERAL COMM. DRAFT RES…
9 R/1/376 TRUSTEESHIP AMENDMENTS "TO ADOPT DRAFT RESOLUTIONS I AND…
10 R/1/394 COUNCIL MEMBER TERM LENGTH "TO ADOPT PARAGRAPH (A) OF THE 6T…
# … with 6,192 more rows
# ℹ Use `print(n = ...)` to see more rows
Lastly, we can use select()
to rename columns in our data. The chunk of code below selects the “unres” and date columns, and renames “unres” to “un_resolution” at the same time.
un_roll_calls |>
select(un_resolution = unres, date)
# A tibble: 6,202 × 2
un_resolution date
<chr> <date>
1 R/1/66 1946-01-01
2 R/1/79 1946-01-02
3 R/1/98 1946-01-04
4 R/1/107 1946-01-04
5 R/1/295 1946-01-02
6 R/1/297 1946-01-05
7 R/1/329 1946-02-05
8 R/1/361 1946-02-05
9 R/1/376 1946-02-05
10 R/1/394 1946-02-06
# … with 6,192 more rows
# ℹ Use `print(n = ...)` to see more rows
If you only want to rename columns—without specifying which to select—dplyr has a function rename()
for this purpose. The syntax is similar to the code chunk above new_variable_name = original_variable_name
. This code will keep all columns and change the names of the specified variables.
un_roll_calls |>
rename(un_resolution = unres,
amendment = amend,
paragraph = para)
# A tibble: 6,202 × 9
rcid session importantvote date un_resol…¹ amend…² parag…³ short descr
<int> <dbl> <int> <date> <chr> <int> <int> <chr> <chr>
1 3 1 0 1946-01-01 R/1/66 1 0 AMEN… "TO …
2 4 1 0 1946-01-02 R/1/79 0 0 SECU… "TO …
3 5 1 0 1946-01-04 R/1/98 0 0 VOTI… "TO …
4 6 1 0 1946-01-04 R/1/107 0 0 DECL… "TO …
5 7 1 0 1946-01-02 R/1/295 1 0 GENE… "TO …
6 8 1 0 1946-01-05 R/1/297 1 0 ECOS… "TO …
7 9 1 0 1946-02-05 R/1/329 0 0 POST… "TO …
8 10 1 0 1946-02-05 R/1/361 1 1 U.N.… "TO …
9 11 1 0 1946-02-05 R/1/376 0 0 TRUS… "TO …
10 12 1 0 1946-02-06 R/1/394 1 1 COUN… "TO …
# … with 6,192 more rows, and abbreviated variable names ¹un_resolution,
# ²amendment, ³paragraph
# ℹ Use `print(n = ...)` to see more rows
Mutate
The select()
and rename()
functions are great for tidying up your data sets, but they do not change the underlying variables. To change existing variables or to create new ones we use mutate()
.
Let’s say we discovered that all the roll call IDs in the column “rcid” were supposed to be in multiples of 10. The code chunk below creates a new variable called “rcid_10” which is simply the value of the original “rcid” variable multiplied by 10.
un_votes |>
mutate(rcid_10 = rcid * 10) # Creates the new rcid_10 variable
# A tibble: 869,937 × 5
rcid country country_code vote rcid_10
<dbl> <chr> <chr> <fct> <dbl>
1 3 United States US yes 30
2 3 Canada CA no 30
3 3 Cuba CU yes 30
4 3 Haiti HT yes 30
5 3 Dominican Republic DO yes 30
6 3 Mexico MX yes 30
7 3 Guatemala GT yes 30
8 3 Honduras HN yes 30
9 3 El Salvador SV yes 30
10 3 Nicaragua NI yes 30
# … with 869,927 more rows
# ℹ Use `print(n = ...)` to see more rows
If you didn’t want to create a brand new variable, but instead wanted to overwrite the original variable, you just need to put the original variable to the left of the =
in mutate()
.
un_votes |>
mutate(rcid = rcid * 10) # Overwrites the existing rcid variable
# A tibble: 869,937 × 4
rcid country country_code vote
<dbl> <chr> <chr> <fct>
1 30 United States US yes
2 30 Canada CA no
3 30 Cuba CU yes
4 30 Haiti HT yes
5 30 Dominican Republic DO yes
6 30 Mexico MX yes
7 30 Guatemala GT yes
8 30 Honduras HN yes
9 30 El Salvador SV yes
10 30 Nicaragua NI yes
# … with 869,927 more rows
# ℹ Use `print(n = ...)` to see more rows
We commonly need to create a new variable whose values depend on the values of one of the original variables. The function case_when()
helps us do this inside mutate()
.
# A tibble: 869,937 × 5
rcid country country_code vote vote_dummy
<dbl> <chr> <chr> <fct> <dbl>
1 3 United States US yes 1
2 3 Canada CA no 0
3 3 Cuba CU yes 1
4 3 Haiti HT yes 1
5 3 Dominican Republic DO yes 1
6 3 Mexico MX yes 1
7 3 Guatemala GT yes 1
8 3 Honduras HN yes 1
9 3 El Salvador SV yes 1
10 3 Nicaragua NI yes 1
# … with 869,927 more rows
# ℹ Use `print(n = ...)` to see more rows
As you can see, the code chunk above creates a new variable called “vote_dummy” which takes the value 1
if “vote” equals "yes"
and takes the values 0
if “vote” equals "no"
. Like other programming languages, R uses the ==
logical operator to check whether a value equals, or is equivalent, to some other value. This is different from the single =
which is used to create entire new variables inside mutate()
.
Here is one more example using case_when()
.
un_votes |>
mutate(rcid_era = case_when(rcid < 2000 ~ "old",
rcid >= 2000 & rcid < 6000 ~ "middle",
rcid >= 6000 ~ "recent"))
# A tibble: 869,937 × 5
rcid country country_code vote rcid_era
<dbl> <chr> <chr> <fct> <chr>
1 3 United States US yes old
2 3 Canada CA no old
3 3 Cuba CU yes old
4 3 Haiti HT yes old
5 3 Dominican Republic DO yes old
6 3 Mexico MX yes old
7 3 Guatemala GT yes old
8 3 Honduras HN yes old
9 3 El Salvador SV yes old
10 3 Nicaragua NI yes old
# … with 869,927 more rows
# ℹ Use `print(n = ...)` to see more rows
The new variable “rcid_era” takes three values, "old"
, "middle"
, and "recent"
based on what range the “rcid” variable falls within.
To recap, the most commonly used functions for cleaning columns/variables in your data are:
3.4 Working with Rows
Filter
Dplyr’s primary function for removing unwanted rows is filter()
. Like we saw when using case_when()
inside mutate()
, using filter()
requires some practice with logical operators. The function filter()
works by specifying some variable and only keeping rows in the data for which the logical operation evaluates to TRUE
. Here is an example.
un_roll_calls |>
filter(importantvote == 1)
# A tibble: 411 × 9
rcid session importantvote date unres amend para short descr
<int> <dbl> <int> <date> <chr> <int> <int> <chr> <chr>
1 2491 38 1 1983-10-04 R/38/3 0 0 KAMPUCHEA TO R…
2 2492 38 1 1983-11-06 R/38/7 0 0 GRENADA, … TO D…
3 2497 38 1 1983-11-06 R/38/29 0 0 AFGHANIST… TO D…
4 2504 38 1 1983-12-06 R/38/39A 0 0 APARTHEID… TO A…
5 2510 38 1 1983-12-06 R/38/39G 0 0 SOUTH AFR… TO R…
6 2526 38 1 1983-12-06 R/38/180E 0 0 ISRAEL, I… TO O…
7 2563 38 1 1983-12-07 R/38/187C 0 0 CHEMICAL,… TO N…
8 2610 38 1 1983-12-03 R/38/101 0 0 HUMAN RIG… TO E…
9 2641 39 1 1984-10-04 R/39/5 0 0 KAMPUCHEA TO R…
10 2645 39 1 1984-11-02 R/39/13 0 0 AFGHANIST… TO R…
# … with 401 more rows
# ℹ Use `print(n = ...)` to see more rows
The original “un_roll_calls” data set object has 6202 rows, whereas this new data set only has 411 rows. This is because we filtered out any row in which the variable “importantvote” was not equal to 1
. Like select()
, we can extend the use of filter()
in many ways.
# & for AND
un_roll_calls |>
filter(importantvote == 1 & session > 40)
# A tibble: 386 × 9
rcid session importantvote date unres amend para short descr
<int> <dbl> <int> <date> <chr> <int> <int> <chr> <chr>
1 2948 41 1 1986-10-03 R/41/6 NA NA KAMPUCHEA The …
2 2957 41 1 1986-11-02 R/41/31 NA NA NICARAGUA… Urge…
3 2958 41 1 1986-11-04 R/41/33 NA NA AFGHANIST… The …
4 2967 41 1 1986-11-05 R/41/38 NA NA LIBYA, U.… Decl…
5 2968 41 1 1986-11-05 R/41/39A NA NA NAMIBIA, … Situ…
6 2996 41 1 1986-12-04 R/41/58C NA NA CHEMICAL,… Proh…
7 3075 41 1 1986-12-05 R/41/158 NA NA HUMAN RIG… Ques…
8 3078 41 1 1986-12-05 R/41/162A NA NA MIDDLE EA… Reaf…
9 3098 41 1 1986-12-05 R/41/211A NA NA BUDGET, 1… Revi…
10 3099 41 1 1986-12-05 R/41/211B NA NA BUDGET, 1… Revi…
# … with 376 more rows
# ℹ Use `print(n = ...)` to see more rows
# | for OR
un_roll_calls |>
filter(importantvote == 1 | session > 40)
# A tibble: 3,228 × 9
rcid session importantvote date unres amend para short descr
<int> <dbl> <int> <date> <chr> <int> <int> <chr> <chr>
1 2491 38 1 1983-10-04 R/38/3 0 0 KAMPUCHEA TO R…
2 2492 38 1 1983-11-06 R/38/7 0 0 GRENADA, … TO D…
3 2497 38 1 1983-11-06 R/38/29 0 0 AFGHANIST… TO D…
4 2504 38 1 1983-12-06 R/38/39A 0 0 APARTHEID… TO A…
5 2510 38 1 1983-12-06 R/38/39G 0 0 SOUTH AFR… TO R…
6 2526 38 1 1983-12-06 R/38/180E 0 0 ISRAEL, I… TO O…
7 2563 38 1 1983-12-07 R/38/187C 0 0 CHEMICAL,… TO N…
8 2610 38 1 1983-12-03 R/38/101 0 0 HUMAN RIG… TO E…
9 2641 39 1 1984-10-04 R/39/5 0 0 KAMPUCHEA TO R…
10 2645 39 1 1984-11-02 R/39/13 0 0 AFGHANIST… TO R…
# … with 3,218 more rows
# ℹ Use `print(n = ...)` to see more rows
The first chunk above filters out all rows in which “importantvote” did not equal 1
AND “session” was less than 40
. The second chunk above filters out all rows in which “importantvote” did not equal 1
OR “session” was less than 40
.
Removing missing, or NA
, values is another common job for filter()
. To do this you can use the function is.na()
inside filter()
. The function is.na()
evaluates to TRUE
if the variable’s value is NA
and evaluates to FALSE
otherwise. Because filter()
only keeps rows where the condition evaluates to TRUE
, in order to remove NA
values we need to negate is.na()
with the !
operator. The !
operator flips the truthiness of any logical statement it precedes.
# A tibble: 2,868 × 9
rcid session importantvote date unres amend para short descr
<int> <dbl> <int> <date> <chr> <int> <int> <chr> <chr>
1 3 1 0 1946-01-01 R/1/66 1 0 AMENDMENTS,… "TO …
2 4 1 0 1946-01-02 R/1/79 0 0 SECURITY CO… "TO …
3 5 1 0 1946-01-04 R/1/98 0 0 VOTING PROC… "TO …
4 6 1 0 1946-01-04 R/1/107 0 0 DECLARATION… "TO …
5 7 1 0 1946-01-02 R/1/295 1 0 GENERAL ASS… "TO …
6 8 1 0 1946-01-05 R/1/297 1 0 ECOSOC POWE… "TO …
7 9 1 0 1946-02-05 R/1/329 0 0 POST-WAR RE… "TO …
8 10 1 0 1946-02-05 R/1/361 1 1 U.N. MEMBER… "TO …
9 11 1 0 1946-02-05 R/1/376 0 0 TRUSTEESHIP… "TO …
10 12 1 0 1946-02-06 R/1/394 1 1 COUNCIL MEM… "TO …
# … with 2,858 more rows
# ℹ Use `print(n = ...)` to see more rows
It can be confusing to think in terms of logical statements, especially when negation is involved! But you will become more comfortable with filter()
the more you practice.
Another handy logical operator is %in%
. This lets you specify several values at once when checking whether to keep rows in filter()
.
# Cumbersome way
un_votes |>
filter(country == "Kenya" |
country == "Grenada" |
country == "Canada" |
country == "Latvia" |
country == "Yemen" |
country == "Angola")
# A tibble: 23,324 × 4
rcid country country_code vote
<dbl> <chr> <chr> <fct>
1 3 Canada CA no
2 4 Canada CA no
3 5 Canada CA no
4 6 Canada CA no
5 7 Canada CA no
6 8 Canada CA yes
7 9 Canada CA yes
8 10 Canada CA yes
9 11 Canada CA yes
10 12 Canada CA yes
# … with 23,314 more rows
# ℹ Use `print(n = ...)` to see more rows
# Easier way
country_list <- c("Kenya", "Grenada", "Canada",
"Latvia", "Yemen", "Angola")
un_votes |>
filter(country %in% country_list)
# A tibble: 23,324 × 4
rcid country country_code vote
<dbl> <chr> <chr> <fct>
1 3 Canada CA no
2 4 Canada CA no
3 5 Canada CA no
4 6 Canada CA no
5 7 Canada CA no
6 8 Canada CA yes
7 9 Canada CA yes
8 10 Canada CA yes
9 11 Canada CA yes
10 12 Canada CA yes
# … with 23,314 more rows
# ℹ Use `print(n = ...)` to see more rows
The two chunks of code above will produce the same filtered data set, but the second chunk requires fewer lines of code and is more flexible if we want to add or remove countries.
Aggregation
Filtering is great for removing unwanted rows in your data. However, after using filter()
the data set’s unit of analysis typically remains the same. In the code chunk above, for example, we removed most of the countries from our data, but each row in the resulting data set is still a unique country-vote observation.
Let’s say we want to reduce the data set down to the country level, rather than country-vote level, and examine the proportion of “no” votes taken by each country. We can do this in dplyr using the pair of functions group_by()
and summarize()
. Here is an example.
un_votes |>
# Creating a new numeric dummy vote variable
mutate(vote_dummy = case_when(vote == "yes" ~ 1,
vote == "no" ~ 0)) |>
# Specify level of aggregation
group_by(country) |>
# Perform the aggregation function by group
summarize(proportion_yes_vote = mean(vote_dummy, na.rm = TRUE))
# A tibble: 200 × 2
country proportion_yes_vote
<chr> <dbl>
1 Afghanistan 0.938
2 Albania 0.816
3 Algeria 0.965
4 Andorra 0.833
5 Angola 0.985
6 Antigua & Barbuda 0.988
7 Argentina 0.935
8 Armenia 0.950
9 Australia 0.745
10 Austria 0.880
# … with 190 more rows
# ℹ Use `print(n = ...)` to see more rows
The code above uses the function mean()
to calculate the average value of “vote_dummy” within each country. Because “vote_dummy” takes the values 1
and 0
, this average can be interpreted as a proportion of 1’s, or “yes” votes. We need to add na.rm = TRUE
inside mean()
to tell R to ignore NA
values when calculating this average.
If we want to sort our data by the new aggregated column, we can do so using arrange()
after summarize()
.
un_votes |>
mutate(vote_dummy = case_when(vote == "yes" ~ 1,
vote == "no" ~ 0)) |>
group_by(country) |>
summarize(proportion_yes_vote = mean(vote_dummy, na.rm = TRUE)) |>
arrange(proportion_yes_vote)
# A tibble: 200 × 2
country proportion_yes_vote
<chr> <dbl>
1 United States 0.369
2 Israel 0.464
3 Palau 0.541
4 Micronesia (Federated States of) 0.604
5 United Kingdom 0.616
6 Marshall Islands 0.659
7 France 0.662
8 Federal Republic of Germany 0.682
9 Canada 0.701
10 Belgium 0.714
# … with 190 more rows
# ℹ Use `print(n = ...)` to see more rows
# Use arrange(desc(proportion_yes_vote))
# to sort in descending order
The United States votes “No” at a much higher rate than other countries!
Lastly, the dplyr function count()
is very handy for quickly tallying the number of observations within each category of a variable. Simply put the name of the variable you want to count inside count()
and voila.
un_roll_call_issues |>
count(issue)
# A tibble: 6 × 2
issue n
<fct> <int>
1 Colonialism 957
2 Arms control and disarmament 1092
3 Economic development 765
4 Human rights 1015
5 Palestinian conflict 1061
6 Nuclear weapons and nuclear material 855
3.5 Merging Data
So far we have only been working with one data set at a time. But it is rare to find all the data you need for a particular project in one single data set. Frequently we will have to merge data sets together in order to have all our variables in the same place. The most common way to merge data using dplyr is with the left_join()
function.
To illustrate how left_join()
works, let’s say we have two data sets: our main data set, x
and the data set we want to merge, y
. Here is an example:
xy_data <- left_join(x, y, by = "key_variable")
The new data set xy_data
is produced by merging the y
data set into the x
data set based on the shared value of the key_variable
variable. Left joins will always keep all rows in the x
data set, but will only merge in rows from y
if they match up with the key_variable
. Your key variable will typically be something like country ID code or a unit’s name in the data. Because we like to use pipes in this course, the chunk of code above can be rewritten as:
xy_data <- x |>
left_join(y, by = "key_variable")
Now let’s take a look at left_join()
using real data from the unvotes data set.
un_votes |>
left_join(un_roll_calls, by = "rcid")
# A tibble: 869,937 × 12
rcid country count…¹ vote session impor…² date unres amend para
<dbl> <chr> <chr> <fct> <dbl> <int> <date> <chr> <int> <int>
1 3 United Stat… US yes 1 0 1946-01-01 R/1/… 1 0
2 3 Canada CA no 1 0 1946-01-01 R/1/… 1 0
3 3 Cuba CU yes 1 0 1946-01-01 R/1/… 1 0
4 3 Haiti HT yes 1 0 1946-01-01 R/1/… 1 0
5 3 Dominican R… DO yes 1 0 1946-01-01 R/1/… 1 0
6 3 Mexico MX yes 1 0 1946-01-01 R/1/… 1 0
7 3 Guatemala GT yes 1 0 1946-01-01 R/1/… 1 0
8 3 Honduras HN yes 1 0 1946-01-01 R/1/… 1 0
9 3 El Salvador SV yes 1 0 1946-01-01 R/1/… 1 0
10 3 Nicaragua NI yes 1 0 1946-01-01 R/1/… 1 0
# … with 869,927 more rows, 2 more variables: short <chr>, descr <chr>, and
# abbreviated variable names ¹country_code, ²importantvote
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
The code chunk above merged all the variables from the un_roll_calls
data set into the un_votes
data set. Recall that the un_votes
data contains information about how countries voted on particular UN roll call votes, and the un_roll_calls
data contains information about the specific votes (such as whether the vote was an “importantvote”). If there were any rows in un_roll_calls
that had rcid
values which failed to match with an rcid
value in un_votes
, the row would not get merged.
While left_join()
is far and away the most common, and useful, way to merge two data sets when using dplyr, there are various other types of merges which have their niche applications. For a more in-depth look at merging data in dplyr see this page with helpful animations.