Officedown: Bookmark \@ref fails to work for multibyte strings

Created on 27 Aug 2020  ·  7Comments  ·  Source: davidgohel/officedown

Suppose I have a .Rmd file like below:

---
title: "Untitled"
output:
  officedown::rdocx_document:
    default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)



md5-4caffc4710057148fdad652000287a10



# Chapter1 {#ch1}

# Chapter2 {#ch2}

Refer to \@ref(ch1).

When \@ref(ch1) is surrounded by multibyte strings (e.g., Chinese characters), it would possibly encounter errors.

  • Pure multibyte + ref

    • Example: 上下\@ref(ch1)
    • Result: correct
  • Mixed multibyte/singlebyte + ref

    • Example: 上a下\@ref(ch1)
    • Result: incorrect (上a下@ref(ch1))
  • ref + multibyte

    • Example: \@ref(ch1)。
    • Result: compile failed


    Error in nchar(u, itype) : invalid multibyte string, element 1

    Calls: ... regmatches<- -> regmatches -> Map -> mapply ->



Can you please look into this issue? Thanks.


sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 20180)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936
[2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] officer_0.3.12 officedown_0.2.0 flextable_0.5.10
[4] ggplot2_3.3.2 tidyr_1.1.1 knitr_1.29
[7] dplyr_1.0.2 reticulate_1.16

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lattice_0.20-41 prettyunits_1.1.1
[4] sysfonts_0.8.1 ps_1.3.4 utf8_1.1.4
[7] rprojroot_1.3-2 assertthat_0.2.1 digest_0.6.25
[10] R6_2.4.1 backports_1.1.9 evaluate_0.14
[13] pillar_1.4.6 gdtools_0.2.2 rlang_0.4.7
[16] curl_4.3 uuid_0.1-4 data.table_1.13.0
[19] callr_3.4.3 Matrix_1.2-18 rmarkdown_2.3
[22] desc_1.2.0 labeling_0.3 devtools_2.3.1
[25] stringr_1.4.0 munsell_0.5.0 tinytex_0.25
[28] compiler_4.0.2 xfun_0.16 pkgconfig_2.0.3
[31] systemfonts_0.2.3 base64enc_0.1-3 pkgbuild_1.1.0
[34] rvg_0.2.5 htmltools_0.5.0 tidyselect_1.1.0
[37] tibble_3.0.3 bookdown_0.20 fansi_0.4.1
[40] crayon_1.3.4 showtextdb_3.0 withr_2.2.0
[43] grid_4.0.2 jsonlite_1.7.0 gtable_0.3.0
[46] lifecycle_0.2.0 magrittr_1.5 scales_1.1.1
[49] zip_2.1.0 cli_2.0.2 stringi_1.4.6
[52] farver_2.0.3 fs_1.5.0 remotes_2.2.0
[55] testthat_2.3.2 xml2_1.3.2 ellipsis_0.3.1
[58] generics_0.0.2 vctrs_0.3.2 tools_4.0.2
[61] showtext_0.9 glue_1.4.1 purrr_0.3.4
[64] processx_3.4.3 pkgload_1.1.0 yaml_2.2.1
[67] colorspace_1.4-1 sessioninfo_1.1.1 memoise_1.1.0
[70] usethis_1.6.1

bug

All 7 comments

```````


title: "Untitled"
output:
officedown::rdocx_document:

default

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

Chapter1 {#ch1}

Chapter2 {#ch2}

Refer to @ref(ch1).

When @ref(ch1) is surrounded by multibyte strings (e.g., Chinese characters), it would possibly encounter errors.

Your issue is related to the fact you are not working with a UTF-8 encoded file.

R, R Markdown and Windows does not work well when encoding is not UTF-8.

Capture d’écran 2020-08-27 à 10 53 27

Untitled.docx

Yes, @davidgohel, you are right. Althougth the .Rmd file is in UTF-8, the OS is running on GBK encoding. When I change to bookdown::word_document2, the knitr engine manages to compile the file. But I still get ?? where the bookmark is supposed to appear.

You don't need to try new output format functions.

The result shown below is made with a Windows with french locale. But I made sure the file was encoded as UTF-8 (I am using readr::guess_encoding(), if not UTF-8 encoded, I can change it to UTF8 with fpeek::peek_iconv()).

Could you show the result of

readr::guess_encoding("your/rmd/file")

The results are

no | encoding | confidence
---|-------------|-----------:
1 | UTF-8 | 1
2 | windows-1252 | 0.28

Hi @madlogos,

I am aslo a Chinese user. The multibyte problem has also bothered me for a long time. Here is my trick for it:

  1. Write @ref as usual;
  2. Save the Rmd file and readr::read_lines it;
  3. Match the strings containing "\\\\@ref\\([^\\)]+\\)" pattern;
  4. Split it and make sure the "\\\\@ref\\([^\\)]+\\)" on a single line;
  5. Save the character vector to a new Rmd file and render it with the format you like. Done!

For example, 请参考表\@ref(tab: coco)中的数据 should be splited as
[line 1] 请参考表
[line 2] \@ref(tab: coco)
[line 3] 中的数据

Well, I am not sure if this is an effective solution but it works for me. 😄

@bishun945 thank you for the turn-around. Good stuff.

@madlogos I have tried another solution: just switch your system and MS Word language to English.

Was this page helpful?
0 / 5 - 0 ratings