13  データの選択

データフレームに含まれるデータのうちで実際に分析に使うのは一部ということはよくあります。その場合は不要なデータを削除していたほうが見やすいですし、PCへの負担も少なくなると思います。

なお、以下ではRの標準関数を使う場合、tidyversedplyrを使う場合、Pythonのpadansを使う場合、polarsを使う場合を紹介します。

R
library(tidyverse)
library(readxl)
Python
import pandas as pd
import polars as pl

13.1 変数の選択

まずは、特定の変数だけを使う場合を考えます。例えば、以前読み込んだPolityのデータから特定の変数だけを取り出します。具体的には、ccodecountryyearpolity2だけを取り出します。

変数の選択

13.1.1 変数名による選択

R (base)
df_polity <- read_excel("data/p5v2018.xls")

df_polity <- df_polity[,c("ccode", "country", "year", "polity2")]

head(df_polity)
# A tibble: 6 × 4
  ccode country      year polity2
  <dbl> <chr>       <dbl>   <dbl>
1   700 Afghanistan  1800      -6
2   700 Afghanistan  1801      -6
3   700 Afghanistan  1802      -6
4   700 Afghanistan  1803      -6
5   700 Afghanistan  1804      -6
6   700 Afghanistan  1805      -6
R (tidyverse)
df_polity <- read_excel("data/p5v2018.xls") |> 
    select(ccode, country, year, polity2)

head(df_polity)
# A tibble: 6 × 4
  ccode country      year polity2
  <dbl> <chr>       <dbl>   <dbl>
1   700 Afghanistan  1800      -6
2   700 Afghanistan  1801      -6
3   700 Afghanistan  1802      -6
4   700 Afghanistan  1803      -6
5   700 Afghanistan  1804      -6
6   700 Afghanistan  1805      -6
Python (pandas)
df_polity = pd.read_excel("data/p5v2018.xls")

df_polity = df_polity.loc[:,["ccode", "country", "year", "polity2"]]
# df_polity = df_polity[["ccode", "country", "year", "polity2"]]でも可

df_polity.head()
   ccode      country  year  polity2
0    700  Afghanistan  1800     -6.0
1    700  Afghanistan  1801     -6.0
2    700  Afghanistan  1802     -6.0
3    700  Afghanistan  1803     -6.0
4    700  Afghanistan  1804     -6.0
Python (polars)
df_polity = (
    pl.read_excel("data/p5v2018.xls")
    .select(["ccode", "country", "year", "polity2"])
)

df_polity.head()
shape: (5, 4)
ccode country year polity2
i64 str i64 i64
700 "Afghanistan" 1800 -6
700 "Afghanistan" 1801 -6
700 "Afghanistan" 1802 -6
700 "Afghanistan" 1803 -6
700 "Afghanistan" 1804 -6

いずれも期待通りの結果が出ています。ところで、上の4つのコードを見比べていると共通性があることが分かります。

  • Rの標準関数およびPythonのpandasでは,の右側で変数名を入力することで変数を選択しています。
    • pandasの場合はloc属性を使っていること、,の左に「全て」を意味する:が入っている点に注意。
  • 一方、RのtidyverseとPythonのpolarsではselect()を使って変数名を選択しています。

13.1.2 番号による選択

あまり使う機会はありませんが、変数の番号を指定して選択することもできます。それぞれ、3, 5, 6, 12番目の変数です。

R (base)
df_polity <- read_excel("data/p5v2018.xls")

df_polity <- df_polity[,c(3, 5, 6, 12)]

head(df_polity)
# A tibble: 6 × 4
  ccode country      year polity2
  <dbl> <chr>       <dbl>   <dbl>
1   700 Afghanistan  1800      -6
2   700 Afghanistan  1801      -6
3   700 Afghanistan  1802      -6
4   700 Afghanistan  1803      -6
5   700 Afghanistan  1804      -6
6   700 Afghanistan  1805      -6
Python (pandas)
df_polity = pd.read_excel("data/p5v2018.xls")

df_polity = df_polity.iloc[:,[2, 4, 5, 11]]

df_polity.head()
   ccode      country  year  polity2
0    700  Afghanistan  1800     -6.0
1    700  Afghanistan  1801     -6.0
2    700  Afghanistan  1802     -6.0
3    700  Afghanistan  1803     -6.0
4    700  Afghanistan  1804     -6.0
  • locの代わりにilocを使います。

Rは1から始まるので番号をそのまま入力すればいいですが、Pythonは0から始まるので1引いた数を入力します。

  • 直観的にはRの方が自然ですが、プログラミングでは0から始まるのが標準的です。

13.2 変数名の変更

変数の選択ではないですが、ついでに変数名を変更する方法を説明します。例えば、世銀のデータではCountry Nameのように空白が変数名に入っていましたが、変数名に空白が入ることは(特にtidyverseを使う上で)不都合があります。変数名を変更したいと思います。

R (tidyverse)
df_pop_fem <- read_csv("data/wb_pop_fem.csv", skip = 4) |> 
    rename(country_name = "Country Name", coutnry_code = "Country Code", indicator_name = "Indicator Name", indicator_code = "Indicator Code")

head(df_pop_fem)
# A tibble: 6 × 69
  country_name   coutnry_code indicator_name indicator_code `1960` `1961` `1962`
  <chr>          <chr>        <chr>          <chr>           <dbl>  <dbl>  <dbl>
1 Aruba          ABW          Population, f… SP.POP.TOTL.F… 2.78e4 2.84e4 2.88e4
2 Africa Easter… AFE          Population, f… SP.POP.TOTL.F… 6.59e7 6.76e7 6.95e7
3 Afghanistan    AFG          Population, f… SP.POP.TOTL.F… 4.15e6 4.23e6 4.33e6
4 Africa Wester… AFW          Population, f… SP.POP.TOTL.F… 4.88e7 4.99e7 5.09e7
5 Angola         AGO          Population, f… SP.POP.TOTL.F… 2.67e6 2.70e6 2.74e6
6 Albania        ALB          Population, f… SP.POP.TOTL.F… 7.85e5 8.09e5 8.34e5
# ℹ 62 more variables: `1963` <dbl>, `1964` <dbl>, `1965` <dbl>, `1966` <dbl>,
#   `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
#   `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
#   `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
#   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
#   `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
#   `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, …
Python (pandas)
df_pop_fem = pd.read_csv("data/wb_pop_fem.csv", skiprows=4)
df_pop_fem = df_pop_fem.rename(columns={"Country Name": "country_name", "Country Code": "coutnry_code", "Indicator Name": "indicator_name", "Indicator Code": "indicator_code"})

df_pop_fem.head()
                  country_name coutnry_code  ... 2023 Unnamed: 68
0                        Aruba          ABW  ...  NaN         NaN
1  Africa Eastern and Southern          AFE  ...  NaN         NaN
2                  Afghanistan          AFG  ...  NaN         NaN
3   Africa Western and Central          AFW  ...  NaN         NaN
4                       Angola          AGO  ...  NaN         NaN

[5 rows x 69 columns]
Python (polars)
df_pop_fem = (
    pl.read_csv("data/wb_pop_fem.csv", skip_rows=4)
    .rename({"Country Name": "country_name", "Country Code": "coutnry_code", "Indicator Name": "indicator_name", "Indicator Code": "indicator_code"})
)

df_pop_fem.head()
shape: (5, 69)
country_name coutnry_code indicator_name indicator_code 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
str str str str i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 str str
"Aruba" "ABW" "Population, fe… "SP.POP.TOTL.FE… 27773 28380 28820 29218 29570 29875 30135 30253 30232 30166 30063 29927 29953 30229 30595 30972 31245 31416 31584 31749 31909 32121 32389 32659 32886 33008 33007 32904 32788 32892 33480 34657 35941 37137 38437 39724 41014 42336 43688 45050 46269 47178 47831 48374 48877 49414 50016 50636 51272 51919 52484 52980 53480 53953 54403 54828 55224 55591 55935 56254 56373 56330 56272 "" null
"Africa Eastern… "AFE" "Population, fe… "SP.POP.TOTL.FE… 65853220 67606287 69457112 71375643 73386167 75478396 77610073 79810945 82111287 84493601 86968714 89504801 92051419 94694181 97478670 100339888 103289004 106237590 109415983 112834021 116060576 119525759 123410049 127333314 131344567 135563206 139816011 144066893 148288335 152522362 156942214 161298074 165609524 170167926 174762745 179486372 184468529 189280003 194009070 198959676 204048614 209257664 214635664 220167814 225898442 231830259 237997868 244435307 251105628 257956460 265000967 272174714 279546577 287224924 295089133 303195897 311387401 319637365 328159112 336930970 345889868 354855221 363834524 "" null
"Afghanistan" "AFG" "Population, fe… "SP.POP.TOTL.FE… 4145945 4233771 4326881 4424511 4526691 4634341 4745981 4861918 4983086 5108507 5239568 5372747 5509781 5655304 5803603 5948268 6083166 6214979 6342838 6373547 6136856 5490160 4973968 4916351 5074600 5225679 5207273 5152650 5188060 5334609 5346409 5372208 6028939 7000119 7722096 8199445 8537421 8871958 9217591 9595036 9727541 9793166 10438055 11247647 11690825 12109086 12614497 12835340 13088192 13557331 13949295 14468875 15067373 15594637 16172321 16682054 17115346 17614722 18136922 18679089 19279930 19844584 20362329 "" null
"Africa Western… "AFW" "Population, fe… "SP.POP.TOTL.FE… 48802898 49850088 50928609 52044907 53196730 54389295 55621877 56890201 58204276 59560501 60963620 62404746 63900687 65482730 67160750 68943269 70786681 72718234 74770813 76909670 79104037 81359426 83714354 85996392 88238093 90605997 93047700 95556172 98139171 100824162 103478502 106184462 109071980 111968903 114896750 117979287 121143186 124399328 127775360 131211380 134795501 138546839 142408066 146370538 150463219 154696476 159035017 163481052 168058833 172782717 177645233 182657978 187755307 192900081 198163527 203513873 208980433 214578994 220253839 226004857 231877590 237813580 243821774 "" null
"Angola" "AGO" "Population, fe… "SP.POP.TOTL.FE… 2670229 2704394 2742689 2779473 2812590 2838939 2856740 2867926 2879001 2902120 2953347 3032948 3132441 3244749 3362438 3483416 3606782 3735823 3872130 4014347 4164145 4321167 4485276 4656894 4834820 5018620 5206761 5396035 5588733 5787505 5991207 6199060 6408303 6621767 6845622 7077381 7315200 7561436 7813123 8071413 8339311 8619083 8912191 9219638 9545020 9886765 10244381 10620174 11013001 11422969 11853530 12303450 12770743 13252938 13746371 14248799 14764575 15293335 15828040 16370553 16910989 17452283 17998220 "" null

いずれもrename()を使って元の変数名と新しい変数名の対応関係を入力します。

  • tidyverseの場合は新変数名 = 旧変数名ですが、pandas/polarsの場合は旧変数名: 新変数名です。
  • tidyverseselect()の中でも変数名を変更することができます。

13.3 変数の除外

特定の変数を選択するのではなく、除外したいときがあるかもしれません。例えば、SIPRIの軍事費データのNotesは不要な気がします。

tidyverseではselect()の際に!を付けると変数を除外します。

R (readxl)
library(readxl)

df_sipri <- read_excel("data/SIPRI-Milex-data-1949-2022.xlsx", sheet = 5, skip = 5, na = c("xxx", "...")) |> 
    select(!Notes)

head(df_sipri)
# A tibble: 6 × 76
  Country   ...2  `1949.0` `1950.0` `1951.0` `1952.0` `1953.0` `1954.0` `1955.0`
  <chr>     <lgl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 <NA>      NA          NA       NA       NA       NA       NA       NA       NA
2 Africa    NA          NA       NA       NA       NA       NA       NA       NA
3 North Af… NA          NA       NA       NA       NA       NA       NA       NA
4 Algeria   NA          NA       NA       NA       NA       NA       NA       NA
5 Libya     NA          NA       NA       NA       NA       NA       NA       NA
6 Morocco   NA          NA       NA       NA       NA       NA       NA       NA
# ℹ 67 more variables: `1956.0` <dbl>, `1957.0` <dbl>, `1958.0` <dbl>,
#   `1959.0` <dbl>, `1960.0` <dbl>, `1961.0` <dbl>, `1962.0` <dbl>,
#   `1963.0` <dbl>, `1964.0` <dbl>, `1965.0` <dbl>, `1966.0` <dbl>,
#   `1967.0` <dbl>, `1968.0` <dbl>, `1969.0` <dbl>, `1970.0` <dbl>,
#   `1971.0` <dbl>, `1972.0` <dbl>, `1973.0` <dbl>, `1974.0` <dbl>,
#   `1975.0` <dbl>, `1976.0` <dbl>, `1977.0` <dbl>, `1978.0` <dbl>,
#   `1979.0` <dbl>, `1980.0` <dbl>, `1981.0` <dbl>, `1982.0` <dbl>, …

pandaspolarsdrop()メソッドで変数を除外します。

Python (pandas)
df_sipri = pd.read_excel("data/SIPRI-Milex-data-1949-2022.xlsx", sheet_name=5, skiprows=5, na_values=["xxx", "..."])
df_sipri = df_sipri.drop(columns=["Notes"])

df_sipri.head()
        Country  1949  1950  ...        2020         2021         2022
0           NaN   NaN   NaN  ...         NaN          NaN          NaN
1        Africa   NaN   NaN  ...         NaN          NaN          NaN
2  North Africa   NaN   NaN  ...         NaN          NaN          NaN
3       Algeria   NaN   NaN  ...  9708.27744  9112.461105  9145.810174
4         Libya   NaN   NaN  ...         NaN          NaN          NaN

[5 rows x 75 columns]
Python (polars)
df_sipri = (
    pl.read_excel("data/SIPRI-Milex-data-1949-2022.xlsx", sheet_id=5, read_options={"skip_rows": 5, "null_values": ["xxx", "..."]})
    .drop(["Notes"])
)

df_sipri.head()
shape: (5, 75)
Country 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
str f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64
"Africa" null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null
"North Africa" null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null
"Algeria" null null null null null null null null null null null null null null null null null null null null 234.875985 219.434678 215.133118 208.22094 215.742463 413.599801 461.022683 642.532765 560.729351 607.406682 600.773471 683.327939 607.328128 701.198007 704.831389 661.921863 620.079659 610.18241 622.034094 615.542127 601.651385 642.72115 657.986141 1101.032432 1183.864801 1440.243493 1395.429481 1588.840569 1910.994943 2021.120506 2133.027701 2475.100274 2709.025608 2768.30774 2708.645105 3080.937162 3225.387723 3306.186349 3879.058547 4519.638056 4908.797561 5194.999271 7434.910242 7823.704814 8450.873651 9732.621963 10182.147187 10212.538248 9671.777933 9275.706188 10006.914126 9773.554762 9112.461105 8776.364725
"Libya" null null null null null null null null null null null null null null null 62.239411 75.739505 135.605176 171.736926 205.401564 280.364831 501.713486 517.782969 399.804189 496.86425 660.244648 403.379454 433.595285 599.712512 705.411019 877.90173 1006.758017 469.762633 542.251381 null null null null null null null null null null null null null null 528.148467 595.753213 460.005767 492.339926 481.663213 619.030119 770.484367 1006.128971 991.116581 872.043438 820.739134 1240.400997 null null null 2691.266837 3509.174752 3246.886855 null null null null null null null null
"Morocco" null null null null null null null 119.561947 174.29616 231.988959 270.419973 269.797363 307.693292 326.567713 429.706446 386.061343 336.729958 365.821285 381.660589 491.01178 445.838272 460.68119 552.386269 610.999174 694.413605 818.31678 1201.617167 1686.646799 1934.233958 1722.805411 1726.670724 1987.294925 2025.512311 2098.786842 1527.477606 1372.987541 1966.91535 1721.964873 1722.974496 1859.410193 2010.669934 1974.55245 2073.654577 2057.185887 2170.432946 2228.159258 2052.130321 2046.648979 2143.497749 2169.0585 1837.788653 1390.760784 2516.231872 2394.040085 2535.872578 2464.704546 2557.771063 2582.189542 2659.230192 2966.11034 3168.016273 3390.448538 3415.165311 3660.78936 4182.250761 4146.897151 3828.72905 3852.059058 3965.368975 3993.067756 4076.773137 5225.571603 5378.366535 5368.334058

13.4 観察の選択

これまでは変数の操作について見てきましたが、次は観察の選択を学びます。つまり、データに含まれる観察の一部だけを取り出して分析する状況を考えます。

観察の選択

13.4.1 不等号による選択

例えば、Polityのデータのうち、第2次世界大戦後のデータを取り出したいとします。

R (base)
df_polity <- read_excel("data/p5v2018.xls")

df_polity[df_polity$year > 1945,]
# A tibble: 9,952 × 37
      p5   cyear ccode scode country      year  flag fragment democ autoc polity
   <dbl>   <dbl> <dbl> <chr> <chr>       <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl>
 1     0 7001946   700 AFG   Afghanistan  1946     0       NA     0    10    -10
 2     0 7001947   700 AFG   Afghanistan  1947     0       NA     0    10    -10
 3     0 7001948   700 AFG   Afghanistan  1948     0       NA     0    10    -10
 4     0 7001949   700 AFG   Afghanistan  1949     0       NA     0    10    -10
 5     0 7001950   700 AFG   Afghanistan  1950     0       NA     0    10    -10
 6     0 7001951   700 AFG   Afghanistan  1951     0       NA     0    10    -10
 7     0 7001952   700 AFG   Afghanistan  1952     0       NA     0    10    -10
 8     0 7001953   700 AFG   Afghanistan  1953     0       NA     0    10    -10
 9     0 7001954   700 AFG   Afghanistan  1954     0       NA     0    10    -10
10     0 7001955   700 AFG   Afghanistan  1955     0       NA     0    10    -10
# ℹ 9,942 more rows
# ℹ 26 more variables: polity2 <dbl>, durable <dbl>, xrreg <dbl>, xrcomp <dbl>,
#   xropen <dbl>, xconst <dbl>, parreg <dbl>, parcomp <dbl>, exrec <dbl>,
#   exconst <dbl>, polcomp <dbl>, prior <dbl>, emonth <dbl>, eday <dbl>,
#   eyear <dbl>, eprec <dbl>, interim <dbl>, bmonth <dbl>, bday <dbl>,
#   byear <dbl>, bprec <dbl>, post <dbl>, change <dbl>, d5 <dbl>, sf <dbl>,
#   regtrans <dbl>
head(df_polity)
# A tibble: 6 × 37
     p5   cyear ccode scode country      year  flag fragment democ autoc polity
  <dbl>   <dbl> <dbl> <chr> <chr>       <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl>
1     0 7001800   700 AFG   Afghanistan  1800     0       NA     1     7     -6
2     0 7001801   700 AFG   Afghanistan  1801     0       NA     1     7     -6
3     0 7001802   700 AFG   Afghanistan  1802     0       NA     1     7     -6
4     0 7001803   700 AFG   Afghanistan  1803     0       NA     1     7     -6
5     0 7001804   700 AFG   Afghanistan  1804     0       NA     1     7     -6
6     0 7001805   700 AFG   Afghanistan  1805     0       NA     1     7     -6
# ℹ 26 more variables: polity2 <dbl>, durable <dbl>, xrreg <dbl>, xrcomp <dbl>,
#   xropen <dbl>, xconst <dbl>, parreg <dbl>, parcomp <dbl>, exrec <dbl>,
#   exconst <dbl>, polcomp <dbl>, prior <dbl>, emonth <dbl>, eday <dbl>,
#   eyear <dbl>, eprec <dbl>, interim <dbl>, bmonth <dbl>, bday <dbl>,
#   byear <dbl>, bprec <dbl>, post <dbl>, change <dbl>, d5 <dbl>, sf <dbl>,
#   regtrans <dbl>
R (tidyverse)
df_polity <- read_excel("data/p5v2018.xls") |> 
    filter(year > 1945)

head(df_polity)
# A tibble: 6 × 37
     p5   cyear ccode scode country      year  flag fragment democ autoc polity
  <dbl>   <dbl> <dbl> <chr> <chr>       <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl>
1     0 7001946   700 AFG   Afghanistan  1946     0       NA     0    10    -10
2     0 7001947   700 AFG   Afghanistan  1947     0       NA     0    10    -10
3     0 7001948   700 AFG   Afghanistan  1948     0       NA     0    10    -10
4     0 7001949   700 AFG   Afghanistan  1949     0       NA     0    10    -10
5     0 7001950   700 AFG   Afghanistan  1950     0       NA     0    10    -10
6     0 7001951   700 AFG   Afghanistan  1951     0       NA     0    10    -10
# ℹ 26 more variables: polity2 <dbl>, durable <dbl>, xrreg <dbl>, xrcomp <dbl>,
#   xropen <dbl>, xconst <dbl>, parreg <dbl>, parcomp <dbl>, exrec <dbl>,
#   exconst <dbl>, polcomp <dbl>, prior <dbl>, emonth <dbl>, eday <dbl>,
#   eyear <dbl>, eprec <dbl>, interim <dbl>, bmonth <dbl>, bday <dbl>,
#   byear <dbl>, bprec <dbl>, post <dbl>, change <dbl>, d5 <dbl>, sf <dbl>,
#   regtrans <dbl>
Python (pandas)
df_polity = pd.read_excel("data/p5v2018.xls")

df_polity = df_polity.loc[df_polity["year"] > 1945,:]
# df_polity = df_polity[df_polity["year"] > 1945]でも可

df_polity.head()
     p5    cyear  ccode scode      country  ...  post  change  d5  sf  regtrans
146   0  7001946    700   AFG  Afghanistan  ...   NaN     NaN NaN NaN       NaN
147   0  7001947    700   AFG  Afghanistan  ...   NaN     NaN NaN NaN       NaN
148   0  7001948    700   AFG  Afghanistan  ...   NaN     NaN NaN NaN       NaN
149   0  7001949    700   AFG  Afghanistan  ...   NaN     NaN NaN NaN       NaN
150   0  7001950    700   AFG  Afghanistan  ...   NaN     NaN NaN NaN       NaN

[5 rows x 37 columns]
Python (polars)
df_polity = (
    pl.read_excel("data/p5v2018.xls")
    .filter(pl.col("year") > 1945)
)

df_polity.head()
shape: (5, 37)
p5 cyear ccode scode country year flag fragment democ autoc polity polity2 durable xrreg xrcomp xropen xconst parreg parcomp exrec exconst polcomp prior emonth eday eyear eprec interim bmonth bday byear bprec post change d5 sf regtrans
i64 i64 i64 str str i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
0 7001946 700 "AFG" "Afghanistan" 1946 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null null null null null null null null null null
0 7001947 700 "AFG" "Afghanistan" 1947 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null null null null null null null null null null
0 7001948 700 "AFG" "Afghanistan" 1948 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null null null null null null null null null null
0 7001949 700 "AFG" "Afghanistan" 1949 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null null null null null null null null null null
0 7001950 700 "AFG" "Afghanistan" 1950 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null null null null null null null null null null

ここでも共通点が見えてきます。tidyversepolarsではfilter()の中に条件式を入れます。一方、Rの標準関数とpandasでは,の左側に条件式を入れます。

  • <>は通常の意味通り、使うことができます。
  • 等号も含む場合は<=>=とします。

13.4.2 等号による選択

次に、日本だけのデータを取り出したいと思います。=ではなく===が2つ)である点に注意してください。

R (base)
df_polity <- read_excel("data/p5v2018.xls")

df_polity[df_polity$country == "Japan",]
# A tibble: 219 × 37
      p5   cyear ccode scode country  year  flag fragment democ autoc polity
   <dbl>   <dbl> <dbl> <chr> <chr>   <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl>
 1     0 7401800   740 JPN   Japan    1800     0       NA     0    10    -10
 2     0 7401801   740 JPN   Japan    1801     0       NA     0    10    -10
 3     0 7401802   740 JPN   Japan    1802     0       NA     0    10    -10
 4     0 7401803   740 JPN   Japan    1803     0       NA     0    10    -10
 5     0 7401804   740 JPN   Japan    1804     0       NA     0    10    -10
 6     0 7401805   740 JPN   Japan    1805     0       NA     0    10    -10
 7     0 7401806   740 JPN   Japan    1806     0       NA     0    10    -10
 8     0 7401807   740 JPN   Japan    1807     0       NA     0    10    -10
 9     0 7401808   740 JPN   Japan    1808     0       NA     0    10    -10
10     0 7401809   740 JPN   Japan    1809     0       NA     0    10    -10
# ℹ 209 more rows
# ℹ 26 more variables: polity2 <dbl>, durable <dbl>, xrreg <dbl>, xrcomp <dbl>,
#   xropen <dbl>, xconst <dbl>, parreg <dbl>, parcomp <dbl>, exrec <dbl>,
#   exconst <dbl>, polcomp <dbl>, prior <dbl>, emonth <dbl>, eday <dbl>,
#   eyear <dbl>, eprec <dbl>, interim <dbl>, bmonth <dbl>, bday <dbl>,
#   byear <dbl>, bprec <dbl>, post <dbl>, change <dbl>, d5 <dbl>, sf <dbl>,
#   regtrans <dbl>
head(df_polity)
# A tibble: 6 × 37
     p5   cyear ccode scode country      year  flag fragment democ autoc polity
  <dbl>   <dbl> <dbl> <chr> <chr>       <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl>
1     0 7001800   700 AFG   Afghanistan  1800     0       NA     1     7     -6
2     0 7001801   700 AFG   Afghanistan  1801     0       NA     1     7     -6
3     0 7001802   700 AFG   Afghanistan  1802     0       NA     1     7     -6
4     0 7001803   700 AFG   Afghanistan  1803     0       NA     1     7     -6
5     0 7001804   700 AFG   Afghanistan  1804     0       NA     1     7     -6
6     0 7001805   700 AFG   Afghanistan  1805     0       NA     1     7     -6
# ℹ 26 more variables: polity2 <dbl>, durable <dbl>, xrreg <dbl>, xrcomp <dbl>,
#   xropen <dbl>, xconst <dbl>, parreg <dbl>, parcomp <dbl>, exrec <dbl>,
#   exconst <dbl>, polcomp <dbl>, prior <dbl>, emonth <dbl>, eday <dbl>,
#   eyear <dbl>, eprec <dbl>, interim <dbl>, bmonth <dbl>, bday <dbl>,
#   byear <dbl>, bprec <dbl>, post <dbl>, change <dbl>, d5 <dbl>, sf <dbl>,
#   regtrans <dbl>
R (tidyverse)
df_polity <- read_excel("data/p5v2018.xls") |> 
    filter(country == "Japan")

head(df_polity)
# A tibble: 6 × 37
     p5   cyear ccode scode country  year  flag fragment democ autoc polity
  <dbl>   <dbl> <dbl> <chr> <chr>   <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl>
1     0 7401800   740 JPN   Japan    1800     0       NA     0    10    -10
2     0 7401801   740 JPN   Japan    1801     0       NA     0    10    -10
3     0 7401802   740 JPN   Japan    1802     0       NA     0    10    -10
4     0 7401803   740 JPN   Japan    1803     0       NA     0    10    -10
5     0 7401804   740 JPN   Japan    1804     0       NA     0    10    -10
6     0 7401805   740 JPN   Japan    1805     0       NA     0    10    -10
# ℹ 26 more variables: polity2 <dbl>, durable <dbl>, xrreg <dbl>, xrcomp <dbl>,
#   xropen <dbl>, xconst <dbl>, parreg <dbl>, parcomp <dbl>, exrec <dbl>,
#   exconst <dbl>, polcomp <dbl>, prior <dbl>, emonth <dbl>, eday <dbl>,
#   eyear <dbl>, eprec <dbl>, interim <dbl>, bmonth <dbl>, bday <dbl>,
#   byear <dbl>, bprec <dbl>, post <dbl>, change <dbl>, d5 <dbl>, sf <dbl>,
#   regtrans <dbl>
Python (pandas)
df_polity = pd.read_excel("data/p5v2018.xls")

df_polity = df_polity.loc[df_polity["country"] == "Japan",:]

df_polity.head()
      p5    cyear  ccode scode country  ...  post  change   d5  sf  regtrans
7921   0  7401800    740   JPN   Japan  ... -10.0    88.0  1.0 NaN       NaN
7922   0  7401801    740   JPN   Japan  ...   NaN     NaN  NaN NaN       NaN
7923   0  7401802    740   JPN   Japan  ...   NaN     NaN  NaN NaN       NaN
7924   0  7401803    740   JPN   Japan  ...   NaN     NaN  NaN NaN       NaN
7925   0  7401804    740   JPN   Japan  ...   NaN     NaN  NaN NaN       NaN

[5 rows x 37 columns]
Python (polars)
df_polity = (
    pl.read_excel("data/p5v2018.xls")
    .filter(pl.col("country") == "Japan")
)

df_polity.head()
shape: (5, 37)
p5 cyear ccode scode country year flag fragment democ autoc polity polity2 durable xrreg xrcomp xropen xconst parreg parcomp exrec exconst polcomp prior emonth eday eyear eprec interim bmonth bday byear bprec post change d5 sf regtrans
i64 i64 i64 str str i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
0 7401800 740 "JPN" "Japan" 1800 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null 1 1 1800 1 -10 88 1 null null
0 7401801 740 "JPN" "Japan" 1801 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null null null null null null null null null null
0 7401802 740 "JPN" "Japan" 1802 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null null null null null null null null null null
0 7401803 740 "JPN" "Japan" 1803 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null null null null null null null null null null
0 7401804 740 "JPN" "Japan" 1804 0 null 0 10 -10 -10 null 3 1 1 1 4 1 1 1 1 null null null null null null null null null null null null null null null

13.4.3 部分集合による選択

第2次世界大戦後から冷戦終結までのデータを取り出したい場合は、「かつ (and)」を意味する&を使います。

R (base)
df_polity <- read_excel("data/p5v2018.xls")

df_polity <- df_polity[df_polity$year > 1945 & df_polity$year <= 1991,]
R (tidyverse)
df_polity <- read_excel("data/p5v2018.xls") |> 
    filter(year > 1945 & year <= 1991)
Python (pandas)
df_polity = pd.read_excel("data/p5v2018.xls")

df_polity = df_polity.loc[(df_polity["year"] > 1945) & (df_polity["year"] <= 1991),:]
Python (polars)
df_polity = (
    pl.read_excel("data/p5v2018.xls")
    .filter((pl.col("year") > 1945) & (pl.col("year") < 1991))
)
  • Pythonではそれぞれの条件部分を()でくくる必要があります。

なお、次のように範囲を指定することもできます。

R (tidyverse)
df_polity <- read_excel("data/p5v2018.xls") |> 
    filter(between(year, 1946, 1991))
Python (pandas)
df_polity = pd.read_excel("data/p5v2018.xls")

df_polity = df_polity.loc[df_polity["year"].between(1946, 1991),:]
Python (polars)
df_polity = (
    pl.read_excel("data/p5v2018.xls")
    .filter(pl.col("year").is_between(1946, 1991))
)

13.4.4 和集合による選択

日本と中国のデータを取り出す場合は、「または (or)」を意味する|を使います。

R (base)
df_polity <- read_excel("data/p5v2018.xls")

df_polity <- df_polity[df_polity$country == "Japan" | df_polity$country == "China",]
R (tidyverse)
df_polity <- read_excel("data/p5v2018.xls") |> 
    filter(country == "Japan" | country == "China")
Python (pandas)
df_polity = pd.read_excel("data/p5v2018.xls")

df_polity = df_polity.loc[(df_polity["country"] == "Japan") | (df_polity["country"] == "China"),:]
Python (polars)
df_polity = (
    pl.read_excel("data/p5v2018.xls")
    .filter((pl.col("country") == "Japan") | (pl.col("country") == "China"))
)

次のように簡単に書くこともできます。

R (base)
df_polity <- read_excel("data/p5v2018.xls")

df_polity <- df_polity[df_polity$country %in% c("Japan", "China"),]
R (tidyverse)
df_polity <- read_excel("data/p5v2018.xls") |> 
    filter(country %in% c("Japan", "China"))
Python (pandas)
df_polity = pd.read_excel("data/p5v2018.xls")

df_polity = df_polity.loc[df_polity["country"].isin(["Japan", "China"]),:]
Python (polars)
df_polity = (
    pl.read_excel("data/p5v2018.xls")
    .filter(pl.col("country").is_in(["Japan", "China"]))
)

なお、否定を表す場合はRの場合は!を、Pythonの場合は~を条件式の冒頭に付けます。

  • RでもPythonでも==の否定は!=でも構いません。

13.4.5 ユニークなデータの選択

例えば、データフレームに含まれる国名をデータフレームにしたい場合、単に国名の変数だけを取り出すと、いろいろな年のデータも含まれるので重複が生じます。このような場合は、次のようにします。

R (tidyverse)
df_polity <- read_excel("data/p5v2018.xls") |> 
    distinct(country)

head(df_polity)
# A tibble: 6 × 1
  country    
  <chr>      
1 Afghanistan
2 Albania    
3 Algeria    
4 Angola     
5 Argentina  
6 Armenia    
Python (pandas)
df_polity = pd.read_excel("data/p5v2018.xls")

df_polity = df_polity.loc[:,"country"].drop_duplicates()

df_polity.head()
0      Afghanistan
219        Albania
324        Algeria
381         Angola
425      Argentina
Name: country, dtype: object
Python (polars)
df_polity = (
    pl.read_excel("data/p5v2018.xls")
    .select("country")
    .unique(maintain_order=True)
)

df_polity.head()
shape: (5, 1)
country
str
"Afghanistan"
"Albania"
"Algeria"
"Angola"
"Argentina"